ACL Wiki - User contributions [en]

Training the C&C Parser

2015-04-21T12:27:49Z

KEvang:

The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (CCG). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In Computational Linguistics 33(4), http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine (which should have multiple cores and at least around 40 GB of main memory). The steps to take on other recent Linux distributions should be very similar.

Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]] based on instructions from Tim Dawborn; thanks are due to Tim and also to Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.

# Customize these variables:
export CANDC_PREFIX=$HOME
export CCGBANK=$HOME/data/CCGbank1.2
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem
export NUMNODES=32
export LIB=/usr/lib

# Some variables for use below:
export CANDC=$CANDC_PREFIX/candc
export SCRIPTS=$CANDC/src/scripts/ccg
export EXT=$CANDC/ext

# Package dependencies:
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion

# Check out the C&C tools.
# You need credentials for that, see
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion
cd $CANDC_PREFIX
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400

# Some patches to fix various problems with the scripts provided:

# Use a temp directory different from /tmp since that often doesn't have enough
# space:
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*

# Replace /bin/env by /usr/bin/env
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \
$SCRIPTS/count_features

# Work around non-portable sed -f shebang
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \
$SCRIPTS/create_data

# TODO patches to make the scripts work with the LDC version of CCGbank should
# go here.

# Make ext directory
mkdir $EXT

# Install Boost library (Ubuntu doesn't seem to have a version that is compiled
# against MPICH2).
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI
# library without this for some reason
mkdir $EXT/install
cd $EXT/install
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or
# get it from Sourceforge
tar -xzf boost_1_53_0.tar.gz
cd boost_1_53_0
./bootstrap.sh --with-libraries=mpi --prefix=$EXT
./b2 install

# Install ancient MR-MPI C&C depends on
cd $EXT/install
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2
tar jxf mrmpi-22Apr09.tbz2
cd mrmpi-22Apr09/src
make -f Makefile.unix clean
make -f Makefile.unix
cp *.h $EXT/include
cp libmrmpi.a $EXT/lib

# Build C&C
cd $CANDC
make -f Makefile.unix all train bin/generate

# Create data
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank
$SCRIPTS/create_data $CCGBANK $NUMNODES working/

# Train the POS tagger and Supertagger:
$SCRIPTS/train_taggers working/

# Evaluate the supertagger model to ensure its results are sane:
$SCRIPTS/cl07_table4 working/

# Create the model_hybrid directory and empty config file:
mkdir working/model_hybrid
touch working/model_hybrid/config

# Train a hybrid model:
export LD_LIBRARY_PATH=$EXT/lib:$LIB
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/

# Evaluate the parser model:
$SCRIPTS/cl07_table7 working/

== References ==

Training the C&C Parser

2015-04-21T12:27:34Z

KEvang:

Training the C&C Parser

2013-09-10T10:55:08Z

KEvang:

The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (CCG). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In Computational Linguistics 33(4), http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine (which should have multiple cores and at least around 40 GB of main memory). The steps to take on other recent Linux distributions should be very similar.

Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]] based on instructions from Tim Dawborn; thanks are due to Tim and also to Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.

# Customize these variables:
export CANDC_PREFIX=$HOME
export CCGBANK=$HOME/data/CCGbank1.2
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem
export NUMNODES=32
export LIB=/usr/lib

# Some variables for use below:
export CANDC=$CANDC_PREFIX/candc
export SCRIPTS=$CANDC/src/scripts/ccg
export EXT=$CANDC/ext

# Package dependencies:
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion

# Check out the C&C tools.
# You need credentials for that, see
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion
cd $CANDC_PREFIX
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400

# Some patches to fix various problems with the scripts provided:

# Use a temp directory different from /tmp since that often doesn't have enough
# space:
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*

# Replace /bin/env by /usr/bin/env
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \
$SCRIPTS/count_features

# Work around non-portable sed -f shebang
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \
$SCRIPTS/create_data

# TODO patches to make the scripts work with the LDC version of CCGbank should
# go here.

# Make ext directory
mkdir $EXT

# Install Boost library (Ubuntu doesn't seem to have a version that is compiled
# against MPICH2).
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI
# library without this for some reason
mkdir $EXT/install
cd $EXT/install
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or
# get it from Sourceforge
tar -xzf boost_1_53_0.tar.gz
cd boost_1_53_0
./bootstrap.sh --with-libraries=mpi --prefix=$EXT
./b2 install

# Install ancient MR-MPI C&C depends on
cd $EXT/install
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2
tar jxf mrmpi-22Apr09.tbz2
cd mrmpi-22Apr09/src
make -f Makefile.linux clean
make -f Makefile.linux
cp *.h $EXT/include
cp libmrmpi.a $EXT/lib

# Build C&C
cd $CANDC
make -f Makefile.linux all train bin/generate

# Create data
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank
$SCRIPTS/create_data $CCGBANK $NUMNODES working/

# Train the POS tagger and Supertagger:
$SCRIPTS/train_taggers working/

# Evaluate the supertagger model to ensure its results are sane:
$SCRIPTS/cl07_table4 working/

# Create the model_hybrid directory and empty config file:
mkdir working/model_hybrid
touch working/model_hybrid/config

# Train a hybrid model:
export LD_LIBRARY_PATH=$EXT/lib:$LIB
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/

# Evaluate the parser model:
$SCRIPTS/cl07_table7 working/

Training the C&C Parser

2013-09-10T10:54:11Z

KEvang:

The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (CCG). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In Computational Linguistics 33(4), http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine (which should have multiple cores and at least around 40 GB of main memory). The steps to take on other recent Linux distributions should be very similar.

Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]]; thanks are due to Tim Dawborn, Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.

# Customize these variables:
export CANDC_PREFIX=$HOME
export CCGBANK=$HOME/data/CCGbank1.2
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem
export NUMNODES=32
export LIB=/usr/lib

# Some variables for use below:
export CANDC=$CANDC_PREFIX/candc
export SCRIPTS=$CANDC/src/scripts/ccg
export EXT=$CANDC/ext

# Package dependencies:
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion

# Check out the C&C tools.
# You need credentials for that, see
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion
cd $CANDC_PREFIX
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400

# Some patches to fix various problems with the scripts provided:

# Use a temp directory different from /tmp since that often doesn't have enough
# space:
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*

# Replace /bin/env by /usr/bin/env
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \
$SCRIPTS/count_features

# Work around non-portable sed -f shebang
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \
$SCRIPTS/create_data

# TODO patches to make the scripts work with the LDC version of CCGbank should
# go here.

# Make ext directory
mkdir $EXT

# Install Boost library (Ubuntu doesn't seem to have a version that is compiled
# against MPICH2).
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI
# library without this for some reason
mkdir $EXT/install
cd $EXT/install
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or
# get it from Sourceforge
tar -xzf boost_1_53_0.tar.gz
cd boost_1_53_0
./bootstrap.sh --with-libraries=mpi --prefix=$EXT
./b2 install

# Install ancient MR-MPI C&C depends on
cd $EXT/install
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2
tar jxf mrmpi-22Apr09.tbz2
cd mrmpi-22Apr09/src
make -f Makefile.linux clean
make -f Makefile.linux
cp *.h $EXT/include
cp libmrmpi.a $EXT/lib

# Build C&C
cd $CANDC
make -f Makefile.linux all train bin/generate

# Create data
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank
$SCRIPTS/create_data $CCGBANK $NUMNODES working/

# Train the POS tagger and Supertagger:
$SCRIPTS/train_taggers working/

# Evaluate the supertagger model to ensure its results are sane:
$SCRIPTS/cl07_table4 working/

# Create the model_hybrid directory and empty config file:
mkdir working/model_hybrid
touch working/model_hybrid/config

# Train a hybrid model:
export LD_LIBRARY_PATH=$EXT/lib:$LIB
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/

# Evaluate the parser model:
$SCRIPTS/cl07_table7 working/

Training the C&C Parser

2013-09-10T10:46:51Z

KEvang: typo

The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (CCG). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In Computational Linguistics 33(4), http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine. The steps to take on other recent Linux distributions should be very similar.

Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]]; thanks are due to Tim Dawborn, Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.

# Customize these variables:
export CANDC_PREFIX=$HOME
export CCGBANK=$HOME/data/CCGbank1.2
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem
export NUMNODES=32
export LIB=/usr/lib

# Some variables for use below:
export CANDC=$CANDC_PREFIX/candc
export SCRIPTS=$CANDC/src/scripts/ccg
export EXT=$CANDC/ext

# Package dependencies:
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion

# Check out the C&C tools.
# You need credentials for that, see
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion
cd $CANDC_PREFIX
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400

# Some patches to fix various problems with the scripts provided:

# Use a temp directory different from /tmp since that often doesn't have enough
# space:
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*

# Replace /bin/env by /usr/bin/env
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \
$SCRIPTS/count_features

# Work around non-portable sed -f shebang
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \
$SCRIPTS/create_data

# TODO patches to make the scripts work with the LDC version of CCGbank should
# go here.

# Make ext directory
mkdir $EXT

# Install Boost library (Ubuntu doesn't seem to have a version that is compiled
# against MPICH2).
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI
# library without this for some reason
mkdir $EXT/install
cd $EXT/install
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or
# get it from Sourceforge
tar -xzf boost_1_53_0.tar.gz
cd boost_1_53_0
./bootstrap.sh --with-libraries=mpi --prefix=$EXT
./b2 install

# Install ancient MR-MPI C&C depends on
cd $EXT/install
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2
tar jxf mrmpi-22Apr09.tbz2
cd mrmpi-22Apr09/src
make -f Makefile.linux clean
make -f Makefile.linux
cp *.h $EXT/include
cp libmrmpi.a $EXT/lib

# Build C&C
cd $CANDC
make -f Makefile.linux all train bin/generate

# Create data
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank
$SCRIPTS/create_data $CCGBANK $NUMNODES working/

# Train the POS tagger and Supertagger:
$SCRIPTS/train_taggers working/

# Evaluate the supertagger model to ensure its results are sane:
$SCRIPTS/cl07_table4 working/

# Create the model_hybrid directory and empty config file:
mkdir working/model_hybrid
touch working/model_hybrid/config

# Train a hybrid model:
export LD_LIBRARY_PATH=$EXT/lib:$LIB
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/

# Evaluate the parser model:
$SCRIPTS/cl07_table7 working/

Training the C&C Parser

2013-09-09T15:45:16Z

KEvang:

The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (C&C). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In Computational Linguistics 33(4), http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine. The steps to take on other recent Linux distributions should be very similar.

Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]]; thanks are due to Tim Dawborn, Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.

# Customize these variables:
export CANDC_PREFIX=$HOME
export CCGBANK=$HOME/data/CCGbank1.2
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem
export NUMNODES=32
export LIB=/usr/lib

# Some variables for use below:
export CANDC=$CANDC_PREFIX/candc
export SCRIPTS=$CANDC/src/scripts/ccg
export EXT=$CANDC/ext

# Package dependencies:
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion

# Check out the C&C tools.
# You need credentials for that, see
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion
cd $CANDC_PREFIX
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400

# Some patches to fix various problems with the scripts provided:

# Use a temp directory different from /tmp since that often doesn't have enough
# space:
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*

# Replace /bin/env by /usr/bin/env
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \
$SCRIPTS/count_features

# Work around non-portable sed -f shebang
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \
$SCRIPTS/create_data

# TODO patches to make the scripts work with the LDC version of CCGbank should
# go here.

# Make ext directory
mkdir $EXT

# Install Boost library (Ubuntu doesn't seem to have a version that is compiled
# against MPICH2).
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI
# library without this for some reason
mkdir $EXT/install
cd $EXT/install
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or
# get it from Sourceforge
tar -xzf boost_1_53_0.tar.gz
cd boost_1_53_0
./bootstrap.sh --with-libraries=mpi --prefix=$EXT
./b2 install

# Install ancient MR-MPI C&C depends on
cd $EXT/install
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2
tar jxf mrmpi-22Apr09.tbz2
cd mrmpi-22Apr09/src
make -f Makefile.linux clean
make -f Makefile.linux
cp *.h $EXT/include
cp libmrmpi.a $EXT/lib

# Build C&C
cd $CANDC
make -f Makefile.linux all train bin/generate

# Create data
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank
$SCRIPTS/create_data $CCGBANK $NUMNODES working/

# Train the POS tagger and Supertagger:
$SCRIPTS/train_taggers working/

# Evaluate the supertagger model to ensure its results are sane:
$SCRIPTS/cl07_table4 working/

# Create the model_hybrid directory and empty config file:
mkdir working/model_hybrid
touch working/model_hybrid/config

# Train a hybrid model:
export LD_LIBRARY_PATH=$EXT/lib:$LIB
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/

# Evaluate the parser model:
$SCRIPTS/cl07_table7 working/

Training the C&C Parser

2013-09-09T15:44:28Z

KEvang:

The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (C&C). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In Computational Linguistics 33(4), http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine. The steps to take on other recent Linux distributions should be very similar.

Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]]; thanks are due to Tim Dawborn, Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.

# Customize these variables:
export CANDC_PREFIX=$HOME
export CCGBANK=$HOME/data/CCGbank1.2 #
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem
export NUMNODES=32
export LIB=/usr/lib

# Some variables for use below:
export CANDC=$CANDC_PREFIX/candc
export SCRIPTS=$CANDC/src/scripts/ccg
export EXT=$CANDC/ext

# Package dependencies:
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion

# Check out the C&C tools.
# You need credentials for that, see
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion
cd $CANDC_PREFIX
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400

# Some patches to fix various problems with the scripts provided:

# Use a temp directory different from /tmp since that often doesn't have enough
# space:
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*

# Replace /bin/env by /usr/bin/env
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \
$SCRIPTS/count_features

# Work around non-portable sed -f shebang
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \
$SCRIPTS/create_data

# TODO patches to make the scripts work with the LDC version of CCGbank should
# go here.

# Make ext directory
mkdir $EXT

# Install Boost library (Ubuntu doesn't seem to have a version that is compiled
# against MPICH2).
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI
# library without this for some reason
mkdir $EXT/install
cd $EXT/install
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or
# get it from Sourceforge
tar -xzf boost_1_53_0.tar.gz
cd boost_1_53_0
./bootstrap.sh --with-libraries=mpi --prefix=$EXT
./b2 install

# Install ancient MR-MPI C&C depends on
cd $EXT/install
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2
tar jxf mrmpi-22Apr09.tbz2
cd mrmpi-22Apr09/src
make -f Makefile.linux clean
make -f Makefile.linux
cp *.h $EXT/include
cp libmrmpi.a $EXT/lib

# Build C&C
cd $CANDC
make -f Makefile.linux all train bin/generate

# Create data
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank
$SCRIPTS/create_data $CCGBANK $NUMNODES working/

# Train the POS tagger and Supertagger:
$SCRIPTS/train_taggers working/

# Evaluate the supertagger model to ensure its results are sane:
$SCRIPTS/cl07_table4 working/

# Create the model_hybrid directory and empty config file:
mkdir working/model_hybrid
touch working/model_hybrid/config

# Train a hybrid model:
export LD_LIBRARY_PATH=$EXT/lib:$LIB
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/

# Evaluate the parser model:
$SCRIPTS/cl07_table7 working/

Training the C&C Parser

2013-09-09T15:39:12Z

KEvang: Created page with "The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (C&C). It is quite easy to us..."

The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (C&C). It is quite easy to use with pre-trained models, but creating one's own models is a slightly differnt stories. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are terse but detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In Computational Linguistics 33(4), http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine. The steps to take on other recent Linux distributions should be very similar.

Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]]; thanks are due to Tim Dawborn, Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.

# Customize these variables:
export CANDC_PREFIX=$HOME
export CCGBANK=$HOME/data/CCGbank1.2
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem
export NUMNODES=32
export LIB=/usr/lib

# Some variables for use below:
export CANDC=$CANDC_PREFIX/candc
export SCRIPTS=$CANDC/src/scripts/ccg
export EXT=$CANDC/ext

# Package dependencies:
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion

# Check out the C&C tools.
# You need credentials for that, see
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion
cd $CANDC_PREFIX
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400

# Some patches to fix various problems with the scripts provided:

# Use a temp directory different from /tmp since that often doesn't have enough
# space:
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*

# Replace /bin/env by /usr/bin/env
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \
$SCRIPTS/count_features

# Work around non-portable sed -f shebang
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \
$SCRIPTS/create_data

# TODO patches to make the scripts work with the LDC version of CCGbank should
# go here.

# Make ext directory
mkdir $EXT

# Install Boost library (Ubuntu doesn't seem to have a version that is compiled
# against MPICH2).
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI
# library without this for some reason
mkdir $EXT/install
cd $EXT/install
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or
# get it from Sourceforge
tar -xzf boost_1_53_0.tar.gz
cd boost_1_53_0
./bootstrap.sh --with-libraries=mpi --prefix=$EXT
./b2 install

# Install ancient MR-MPI C&C depends on
cd $EXT/install
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2
tar jxf mrmpi-22Apr09.tbz2
cd mrmpi-22Apr09/src
make -f Makefile.linux clean
make -f Makefile.linux
cp *.h $EXT/include
cp libmrmpi.a $EXT/lib

# Build C&C
cd $CANDC
make -f Makefile.linux all train bin/generate

# Create data
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank
$SCRIPTS/create_data $CCGBANK $NUMNODES working/

# Train the POS tagger and Supertagger:
$SCRIPTS/train_taggers working/

# Evaluate the supertagger model to ensure its results are sane:
$SCRIPTS/cl07_table4 working/

# Create the model_hybrid directory and empty config file:
mkdir working/model_hybrid
touch working/model_hybrid/config

# Train a hybrid model:
export LD_LIBRARY_PATH=$EXT/lib:$LIB
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/

# Evaluate the parser model:
$SCRIPTS/cl07_table7 working/

Research

2013-09-09T14:52:52Z

KEvang:

This page is a list of links to information on research in Computational Linguistics.

* [http://www.aclweb.org/anthology ACL Anthology] - more than 10,000 CL papers
* [[Bibliographies]]
* [[Books]]
* [[Formalisms]]
* [[Papers]]
* [[Resources]]
* [[Searching for papers]]
* [[Wikipedia articles]] - on topics related to Computational Linguistics

== ACL Wiki articles and tutorials ==
Write your own article or tutorial!

* [[Active Learning for NLP]] (stub)
* [[Computational Lexicology]]
* [[Computational Morphology]] (stub)
* [[Computational Phonology]]
* [[Computational Semantics]]
* [[Computational Syntax]]
* [[Constrained Conditional Model]] (stub)
* [[Dialectometrics]]
* [[Dialogue Systems]] (stub)
* [[Distributional Hypothesis]]
* [[Graph Based Methods]] (stub)
* [[Information Extraction]] (stub)
* [[Lexical Acquisition]] (stub)
* [[Machine Translation]] (stub)
* [[Natural Language Generation Portal]]
* [[Natural Language Understanding]] (redirect)
* [[Multiword Expressions]] (stub)
* [[Parsing]] (stub)
* [[Part-of-speech tagging]]
* [[Question Answering]]
* [[Semantics]] (stub)
* [[Speech Processing]]
* [[Statistical Semantics]]
* [[Text Categorization]]
* [[Textual Entailment]]
* [[Text Summarization]] (stub)
* [[Training the C&C Parser]]
* [[Word Sense Disambiguation]]


[[Category:Research|*]]