https://aclweb.org/aclwiki/api.php?action=feedcontributions&user=KEvang&feedformat=atomACL Wiki - User contributions [en]2024-03-29T12:46:24ZUser contributionsMediaWiki 1.35.2https://aclweb.org/aclwiki/index.php?title=Training_the_C%26C_Parser&diff=11032Training the C&C Parser2015-04-21T12:27:49Z<p>KEvang: </p>
<hr />
<div>The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (CCG). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In <i>Computational Linguistics 33(4)</i>, http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine (which should have multiple cores and at least around 40 GB of main memory). The steps to take on other recent Linux distributions should be very similar.<br />
<br />
Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]] based on instructions from Tim Dawborn; thanks are due to Tim and also to Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.<br />
<br />
# Customize these variables:<br />
export CANDC_PREFIX=$HOME<br />
export CCGBANK=$HOME/data/CCGbank1.2<br />
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem<br />
export NUMNODES=32<br />
export LIB=/usr/lib<br />
<br />
# Some variables for use below:<br />
export CANDC=$CANDC_PREFIX/candc<br />
export SCRIPTS=$CANDC/src/scripts/ccg<br />
export EXT=$CANDC/ext<br />
<br />
# Package dependencies:<br />
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion<br />
<br />
# Check out the C&C tools.<br />
# You need credentials for that, see<br />
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion<br />
cd $CANDC_PREFIX<br />
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400<br />
<br />
# Some patches to fix various problems with the scripts provided:<br />
<br />
# Use a temp directory different from /tmp since that often doesn't have enough<br />
# space:<br />
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*<br />
<br />
# Replace /bin/env by /usr/bin/env<br />
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \<br />
$SCRIPTS/count_features<br />
<br />
# Work around non-portable sed -f shebang<br />
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \<br />
$SCRIPTS/create_data<br />
<br />
# TODO patches to make the scripts work with the LDC version of CCGbank should<br />
# go here.<br />
<br />
# Make ext directory<br />
mkdir $EXT<br />
<br />
# Install Boost library (Ubuntu doesn't seem to have a version that is compiled<br />
# against MPICH2).<br />
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI<br />
# library without this for some reason<br />
mkdir $EXT/install<br />
cd $EXT/install<br />
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or<br />
# get it from Sourceforge<br />
tar -xzf boost_1_53_0.tar.gz<br />
cd boost_1_53_0<br />
./bootstrap.sh --with-libraries=mpi --prefix=$EXT<br />
./b2 install<br />
<br />
# Install ancient MR-MPI C&C depends on<br />
cd $EXT/install<br />
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is<br />
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2<br />
tar jxf mrmpi-22Apr09.tbz2<br />
cd mrmpi-22Apr09/src<br />
make -f Makefile.unix clean<br />
make -f Makefile.unix<br />
cp *.h $EXT/include<br />
cp libmrmpi.a $EXT/lib<br />
<br />
# Build C&C<br />
cd $CANDC<br />
make -f Makefile.unix all train bin/generate<br />
<br />
# Create data<br />
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank<br />
$SCRIPTS/create_data $CCGBANK $NUMNODES working/<br />
<br />
# Train the POS tagger and Supertagger:<br />
$SCRIPTS/train_taggers working/<br />
<br />
# Evaluate the supertagger model to ensure its results are sane:<br />
$SCRIPTS/cl07_table4 working/<br />
<br />
# Create the model_hybrid directory and empty config file:<br />
mkdir working/model_hybrid<br />
touch working/model_hybrid/config<br />
<br />
# Train a hybrid model:<br />
export LD_LIBRARY_PATH=$EXT/lib:$LIB<br />
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/<br />
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/<br />
<br />
# Evaluate the parser model:<br />
$SCRIPTS/cl07_table7 working/<br />
<br />
== References ==</div>KEvanghttps://aclweb.org/aclwiki/index.php?title=Training_the_C%26C_Parser&diff=11031Training the C&C Parser2015-04-21T12:27:34Z<p>KEvang: </p>
<hr />
<div>The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (CCG). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In <i>Computational Linguistics 33(4)</i>, http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine (which should have multiple cores and at least around 40 GB of main memory). The steps to take on other recent Linux distributions should be very similar.<br />
<br />
Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]] based on instructions from Tim Dawborn; thanks are due to Tim and also to Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.<br />
<br />
# Customize these variables:<br />
export CANDC_PREFIX=$HOME<br />
export CCGBANK=$HOME/data/CCGbank1.2<br />
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem<br />
export NUMNODES=32<br />
export LIB=/usr/lib<br />
<br />
# Some variables for use below:<br />
export CANDC=$CANDC_PREFIX/candc<br />
export SCRIPTS=$CANDC/src/scripts/ccg<br />
export EXT=$CANDC/ext<br />
<br />
# Package dependencies:<br />
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion<br />
<br />
# Check out the C&C tools.<br />
# You need credentials for that, see<br />
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion<br />
cd $CANDC_PREFIX<br />
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400<br />
<br />
# Some patches to fix various problems with the scripts provided:<br />
<br />
# Use a temp directory different from /tmp since that often doesn't have enough<br />
# space:<br />
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*<br />
<br />
# Replace /bin/env by /usr/bin/env<br />
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \<br />
$SCRIPTS/count_features<br />
<br />
# Work around non-portable sed -f shebang<br />
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \<br />
$SCRIPTS/create_data<br />
<br />
# TODO patches to make the scripts work with the LDC version of CCGbank should<br />
# go here.<br />
<br />
# Make ext directory<br />
mkdir $EXT<br />
<br />
# Install Boost library (Ubuntu doesn't seem to have a version that is compiled<br />
# against MPICH2).<br />
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI<br />
# library without this for some reason<br />
mkdir $EXT/install<br />
cd $EXT/install<br />
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or<br />
# get it from Sourceforge<br />
tar -xzf boost_1_53_0.tar.gz<br />
cd boost_1_53_0<br />
./bootstrap.sh --with-libraries=mpi --prefix=$EXT<br />
./b2 install<br />
<br />
# Install ancient MR-MPI C&C depends on<br />
cd $EXT/install<br />
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is<br />
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2<br />
tar jxf mrmpi-22Apr09.tbz2<br />
cd mrmpi-22Apr09/src<br />
make -f Makefile.unix clean<br />
make -f Makefile.unix<br />
cp *.h $EXT/include<br />
cp libmrmpi.a $EXT/lib<br />
<br />
# Build C&C<br />
cd $CANDC<br />
make -f Makefile.unix all train bin/generate<br />
<br />
# Create data<br />
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank<br />
$SCRIPTS/create_data $CCGBANK $NUMNODES working/<br />
<br />
# Train the POS tagger and Supertagger:<br />
$SCRIPTS/train_taggers working/<br />
<br />
# Evaluate the supertagger model to ensure its results are sane:<br />
$SCRIPTS/cl07_table4 working/<br />
<br />
# Create the model_hybrid directory and empty config file:<br />
mkdir working/model_hybrid<br />
touch working/model_hybrid/config<br />
<br />
# Train a hybrid model:<br />
export LD_LIBRARY_PATH=$EXT/lib:$LIB<br />
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/<br />
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/<br />
<br />
# Evaluate the parser model:<br />
$SCRIPTS/cl07_table7 working/</div>KEvanghttps://aclweb.org/aclwiki/index.php?title=Training_the_C%26C_Parser&diff=10245Training the C&C Parser2013-09-10T10:55:08Z<p>KEvang: </p>
<hr />
<div>The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (CCG). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In <i>Computational Linguistics 33(4)</i>, http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine (which should have multiple cores and at least around 40 GB of main memory). The steps to take on other recent Linux distributions should be very similar.<br />
<br />
Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]] based on instructions from Tim Dawborn; thanks are due to Tim and also to Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.<br />
<br />
# Customize these variables:<br />
export CANDC_PREFIX=$HOME<br />
export CCGBANK=$HOME/data/CCGbank1.2<br />
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem<br />
export NUMNODES=32<br />
export LIB=/usr/lib<br />
<br />
# Some variables for use below:<br />
export CANDC=$CANDC_PREFIX/candc<br />
export SCRIPTS=$CANDC/src/scripts/ccg<br />
export EXT=$CANDC/ext<br />
<br />
# Package dependencies:<br />
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion<br />
<br />
# Check out the C&C tools.<br />
# You need credentials for that, see<br />
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion<br />
cd $CANDC_PREFIX<br />
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400<br />
<br />
# Some patches to fix various problems with the scripts provided:<br />
<br />
# Use a temp directory different from /tmp since that often doesn't have enough<br />
# space:<br />
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*<br />
<br />
# Replace /bin/env by /usr/bin/env<br />
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \<br />
$SCRIPTS/count_features<br />
<br />
# Work around non-portable sed -f shebang<br />
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \<br />
$SCRIPTS/create_data<br />
<br />
# TODO patches to make the scripts work with the LDC version of CCGbank should<br />
# go here.<br />
<br />
# Make ext directory<br />
mkdir $EXT<br />
<br />
# Install Boost library (Ubuntu doesn't seem to have a version that is compiled<br />
# against MPICH2).<br />
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI<br />
# library without this for some reason<br />
mkdir $EXT/install<br />
cd $EXT/install<br />
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or<br />
# get it from Sourceforge<br />
tar -xzf boost_1_53_0.tar.gz<br />
cd boost_1_53_0<br />
./bootstrap.sh --with-libraries=mpi --prefix=$EXT<br />
./b2 install<br />
<br />
# Install ancient MR-MPI C&C depends on<br />
cd $EXT/install<br />
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is<br />
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2<br />
tar jxf mrmpi-22Apr09.tbz2<br />
cd mrmpi-22Apr09/src<br />
make -f Makefile.linux clean<br />
make -f Makefile.linux<br />
cp *.h $EXT/include<br />
cp libmrmpi.a $EXT/lib<br />
<br />
# Build C&C<br />
cd $CANDC<br />
make -f Makefile.linux all train bin/generate<br />
<br />
# Create data<br />
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank<br />
$SCRIPTS/create_data $CCGBANK $NUMNODES working/<br />
<br />
# Train the POS tagger and Supertagger:<br />
$SCRIPTS/train_taggers working/<br />
<br />
# Evaluate the supertagger model to ensure its results are sane:<br />
$SCRIPTS/cl07_table4 working/<br />
<br />
# Create the model_hybrid directory and empty config file:<br />
mkdir working/model_hybrid<br />
touch working/model_hybrid/config<br />
<br />
# Train a hybrid model:<br />
export LD_LIBRARY_PATH=$EXT/lib:$LIB<br />
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/<br />
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/<br />
<br />
# Evaluate the parser model:<br />
$SCRIPTS/cl07_table7 working/</div>KEvanghttps://aclweb.org/aclwiki/index.php?title=Training_the_C%26C_Parser&diff=10244Training the C&C Parser2013-09-10T10:54:11Z<p>KEvang: </p>
<hr />
<div>The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (CCG). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In <i>Computational Linguistics 33(4)</i>, http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine (which should have multiple cores and at least around 40 GB of main memory). The steps to take on other recent Linux distributions should be very similar.<br />
<br />
Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]]; thanks are due to Tim Dawborn, Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.<br />
<br />
# Customize these variables:<br />
export CANDC_PREFIX=$HOME<br />
export CCGBANK=$HOME/data/CCGbank1.2<br />
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem<br />
export NUMNODES=32<br />
export LIB=/usr/lib<br />
<br />
# Some variables for use below:<br />
export CANDC=$CANDC_PREFIX/candc<br />
export SCRIPTS=$CANDC/src/scripts/ccg<br />
export EXT=$CANDC/ext<br />
<br />
# Package dependencies:<br />
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion<br />
<br />
# Check out the C&C tools.<br />
# You need credentials for that, see<br />
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion<br />
cd $CANDC_PREFIX<br />
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400<br />
<br />
# Some patches to fix various problems with the scripts provided:<br />
<br />
# Use a temp directory different from /tmp since that often doesn't have enough<br />
# space:<br />
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*<br />
<br />
# Replace /bin/env by /usr/bin/env<br />
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \<br />
$SCRIPTS/count_features<br />
<br />
# Work around non-portable sed -f shebang<br />
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \<br />
$SCRIPTS/create_data<br />
<br />
# TODO patches to make the scripts work with the LDC version of CCGbank should<br />
# go here.<br />
<br />
# Make ext directory<br />
mkdir $EXT<br />
<br />
# Install Boost library (Ubuntu doesn't seem to have a version that is compiled<br />
# against MPICH2).<br />
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI<br />
# library without this for some reason<br />
mkdir $EXT/install<br />
cd $EXT/install<br />
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or<br />
# get it from Sourceforge<br />
tar -xzf boost_1_53_0.tar.gz<br />
cd boost_1_53_0<br />
./bootstrap.sh --with-libraries=mpi --prefix=$EXT<br />
./b2 install<br />
<br />
# Install ancient MR-MPI C&C depends on<br />
cd $EXT/install<br />
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is<br />
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2<br />
tar jxf mrmpi-22Apr09.tbz2<br />
cd mrmpi-22Apr09/src<br />
make -f Makefile.linux clean<br />
make -f Makefile.linux<br />
cp *.h $EXT/include<br />
cp libmrmpi.a $EXT/lib<br />
<br />
# Build C&C<br />
cd $CANDC<br />
make -f Makefile.linux all train bin/generate<br />
<br />
# Create data<br />
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank<br />
$SCRIPTS/create_data $CCGBANK $NUMNODES working/<br />
<br />
# Train the POS tagger and Supertagger:<br />
$SCRIPTS/train_taggers working/<br />
<br />
# Evaluate the supertagger model to ensure its results are sane:<br />
$SCRIPTS/cl07_table4 working/<br />
<br />
# Create the model_hybrid directory and empty config file:<br />
mkdir working/model_hybrid<br />
touch working/model_hybrid/config<br />
<br />
# Train a hybrid model:<br />
export LD_LIBRARY_PATH=$EXT/lib:$LIB<br />
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/<br />
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/<br />
<br />
# Evaluate the parser model:<br />
$SCRIPTS/cl07_table7 working/</div>KEvanghttps://aclweb.org/aclwiki/index.php?title=Training_the_C%26C_Parser&diff=10243Training the C&C Parser2013-09-10T10:46:51Z<p>KEvang: typo</p>
<hr />
<div>The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (CCG). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In <i>Computational Linguistics 33(4)</i>, http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine. The steps to take on other recent Linux distributions should be very similar.<br />
<br />
Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]]; thanks are due to Tim Dawborn, Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.<br />
<br />
# Customize these variables:<br />
export CANDC_PREFIX=$HOME<br />
export CCGBANK=$HOME/data/CCGbank1.2<br />
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem<br />
export NUMNODES=32<br />
export LIB=/usr/lib<br />
<br />
# Some variables for use below:<br />
export CANDC=$CANDC_PREFIX/candc<br />
export SCRIPTS=$CANDC/src/scripts/ccg<br />
export EXT=$CANDC/ext<br />
<br />
# Package dependencies:<br />
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion<br />
<br />
# Check out the C&C tools.<br />
# You need credentials for that, see<br />
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion<br />
cd $CANDC_PREFIX<br />
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400<br />
<br />
# Some patches to fix various problems with the scripts provided:<br />
<br />
# Use a temp directory different from /tmp since that often doesn't have enough<br />
# space:<br />
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*<br />
<br />
# Replace /bin/env by /usr/bin/env<br />
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \<br />
$SCRIPTS/count_features<br />
<br />
# Work around non-portable sed -f shebang<br />
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \<br />
$SCRIPTS/create_data<br />
<br />
# TODO patches to make the scripts work with the LDC version of CCGbank should<br />
# go here.<br />
<br />
# Make ext directory<br />
mkdir $EXT<br />
<br />
# Install Boost library (Ubuntu doesn't seem to have a version that is compiled<br />
# against MPICH2).<br />
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI<br />
# library without this for some reason<br />
mkdir $EXT/install<br />
cd $EXT/install<br />
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or<br />
# get it from Sourceforge<br />
tar -xzf boost_1_53_0.tar.gz<br />
cd boost_1_53_0<br />
./bootstrap.sh --with-libraries=mpi --prefix=$EXT<br />
./b2 install<br />
<br />
# Install ancient MR-MPI C&C depends on<br />
cd $EXT/install<br />
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is<br />
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2<br />
tar jxf mrmpi-22Apr09.tbz2<br />
cd mrmpi-22Apr09/src<br />
make -f Makefile.linux clean<br />
make -f Makefile.linux<br />
cp *.h $EXT/include<br />
cp libmrmpi.a $EXT/lib<br />
<br />
# Build C&C<br />
cd $CANDC<br />
make -f Makefile.linux all train bin/generate<br />
<br />
# Create data<br />
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank<br />
$SCRIPTS/create_data $CCGBANK $NUMNODES working/<br />
<br />
# Train the POS tagger and Supertagger:<br />
$SCRIPTS/train_taggers working/<br />
<br />
# Evaluate the supertagger model to ensure its results are sane:<br />
$SCRIPTS/cl07_table4 working/<br />
<br />
# Create the model_hybrid directory and empty config file:<br />
mkdir working/model_hybrid<br />
touch working/model_hybrid/config<br />
<br />
# Train a hybrid model:<br />
export LD_LIBRARY_PATH=$EXT/lib:$LIB<br />
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/<br />
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/<br />
<br />
# Evaluate the parser model:<br />
$SCRIPTS/cl07_table7 working/</div>KEvanghttps://aclweb.org/aclwiki/index.php?title=Training_the_C%26C_Parser&diff=10242Training the C&C Parser2013-09-09T15:45:16Z<p>KEvang: </p>
<hr />
<div>The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (C&C). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In <i>Computational Linguistics 33(4)</i>, http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine. The steps to take on other recent Linux distributions should be very similar.<br />
<br />
Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]]; thanks are due to Tim Dawborn, Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.<br />
<br />
# Customize these variables:<br />
export CANDC_PREFIX=$HOME<br />
export CCGBANK=$HOME/data/CCGbank1.2<br />
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem<br />
export NUMNODES=32<br />
export LIB=/usr/lib<br />
<br />
# Some variables for use below:<br />
export CANDC=$CANDC_PREFIX/candc<br />
export SCRIPTS=$CANDC/src/scripts/ccg<br />
export EXT=$CANDC/ext<br />
<br />
# Package dependencies:<br />
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion<br />
<br />
# Check out the C&C tools.<br />
# You need credentials for that, see<br />
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion<br />
cd $CANDC_PREFIX<br />
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400<br />
<br />
# Some patches to fix various problems with the scripts provided:<br />
<br />
# Use a temp directory different from /tmp since that often doesn't have enough<br />
# space:<br />
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*<br />
<br />
# Replace /bin/env by /usr/bin/env<br />
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \<br />
$SCRIPTS/count_features<br />
<br />
# Work around non-portable sed -f shebang<br />
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \<br />
$SCRIPTS/create_data<br />
<br />
# TODO patches to make the scripts work with the LDC version of CCGbank should<br />
# go here.<br />
<br />
# Make ext directory<br />
mkdir $EXT<br />
<br />
# Install Boost library (Ubuntu doesn't seem to have a version that is compiled<br />
# against MPICH2).<br />
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI<br />
# library without this for some reason<br />
mkdir $EXT/install<br />
cd $EXT/install<br />
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or<br />
# get it from Sourceforge<br />
tar -xzf boost_1_53_0.tar.gz<br />
cd boost_1_53_0<br />
./bootstrap.sh --with-libraries=mpi --prefix=$EXT<br />
./b2 install<br />
<br />
# Install ancient MR-MPI C&C depends on<br />
cd $EXT/install<br />
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is<br />
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2<br />
tar jxf mrmpi-22Apr09.tbz2<br />
cd mrmpi-22Apr09/src<br />
make -f Makefile.linux clean<br />
make -f Makefile.linux<br />
cp *.h $EXT/include<br />
cp libmrmpi.a $EXT/lib<br />
<br />
# Build C&C<br />
cd $CANDC<br />
make -f Makefile.linux all train bin/generate<br />
<br />
# Create data<br />
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank<br />
$SCRIPTS/create_data $CCGBANK $NUMNODES working/<br />
<br />
# Train the POS tagger and Supertagger:<br />
$SCRIPTS/train_taggers working/<br />
<br />
# Evaluate the supertagger model to ensure its results are sane:<br />
$SCRIPTS/cl07_table4 working/<br />
<br />
# Create the model_hybrid directory and empty config file:<br />
mkdir working/model_hybrid<br />
touch working/model_hybrid/config<br />
<br />
# Train a hybrid model:<br />
export LD_LIBRARY_PATH=$EXT/lib:$LIB<br />
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/<br />
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/<br />
<br />
# Evaluate the parser model:<br />
$SCRIPTS/cl07_table7 working/</div>KEvanghttps://aclweb.org/aclwiki/index.php?title=Training_the_C%26C_Parser&diff=10241Training the C&C Parser2013-09-09T15:44:28Z<p>KEvang: </p>
<hr />
<div>The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (C&C). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In <i>Computational Linguistics 33(4)</i>, http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine. The steps to take on other recent Linux distributions should be very similar.<br />
<br />
Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]]; thanks are due to Tim Dawborn, Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.<br />
<br />
# Customize these variables:<br />
export CANDC_PREFIX=$HOME<br />
export CCGBANK=$HOME/data/CCGbank1.2 # <br />
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem<br />
export NUMNODES=32<br />
export LIB=/usr/lib<br />
<br />
# Some variables for use below:<br />
export CANDC=$CANDC_PREFIX/candc<br />
export SCRIPTS=$CANDC/src/scripts/ccg<br />
export EXT=$CANDC/ext<br />
<br />
# Package dependencies:<br />
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion<br />
<br />
# Check out the C&C tools.<br />
# You need credentials for that, see<br />
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion<br />
cd $CANDC_PREFIX<br />
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400<br />
<br />
# Some patches to fix various problems with the scripts provided:<br />
<br />
# Use a temp directory different from /tmp since that often doesn't have enough<br />
# space:<br />
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*<br />
<br />
# Replace /bin/env by /usr/bin/env<br />
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \<br />
$SCRIPTS/count_features<br />
<br />
# Work around non-portable sed -f shebang<br />
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \<br />
$SCRIPTS/create_data<br />
<br />
# TODO patches to make the scripts work with the LDC version of CCGbank should<br />
# go here.<br />
<br />
# Make ext directory<br />
mkdir $EXT<br />
<br />
# Install Boost library (Ubuntu doesn't seem to have a version that is compiled<br />
# against MPICH2).<br />
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI<br />
# library without this for some reason<br />
mkdir $EXT/install<br />
cd $EXT/install<br />
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or<br />
# get it from Sourceforge<br />
tar -xzf boost_1_53_0.tar.gz<br />
cd boost_1_53_0<br />
./bootstrap.sh --with-libraries=mpi --prefix=$EXT<br />
./b2 install<br />
<br />
# Install ancient MR-MPI C&C depends on<br />
cd $EXT/install<br />
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is<br />
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2<br />
tar jxf mrmpi-22Apr09.tbz2<br />
cd mrmpi-22Apr09/src<br />
make -f Makefile.linux clean<br />
make -f Makefile.linux<br />
cp *.h $EXT/include<br />
cp libmrmpi.a $EXT/lib<br />
<br />
# Build C&C<br />
cd $CANDC<br />
make -f Makefile.linux all train bin/generate<br />
<br />
# Create data<br />
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank<br />
$SCRIPTS/create_data $CCGBANK $NUMNODES working/<br />
<br />
# Train the POS tagger and Supertagger:<br />
$SCRIPTS/train_taggers working/<br />
<br />
# Evaluate the supertagger model to ensure its results are sane:<br />
$SCRIPTS/cl07_table4 working/<br />
<br />
# Create the model_hybrid directory and empty config file:<br />
mkdir working/model_hybrid<br />
touch working/model_hybrid/config<br />
<br />
# Train a hybrid model:<br />
export LD_LIBRARY_PATH=$EXT/lib:$LIB<br />
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/<br />
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/<br />
<br />
# Evaluate the parser model:<br />
$SCRIPTS/cl07_table7 working/</div>KEvanghttps://aclweb.org/aclwiki/index.php?title=Training_the_C%26C_Parser&diff=10240Training the C&C Parser2013-09-09T15:39:12Z<p>KEvang: Created page with "The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (C&C). It is quite easy to us..."</p>
<hr />
<div>The [http://svn.ask.it.usyd.edu.au/trac/candc C&C Parser] is an advanced statistical parser using the framework of Combinatory Categorial Grammar (C&C). It is quite easy to use with pre-trained models, but creating one's own models is a slightly differnt stories. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are terse but detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007)<ref>Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In <i>Computational Linguistics 33(4)</i>, http://aclweb.org/anthology-new/J/J07/J07-4004.pdf</ref> on a single '''64-bit Ubuntu 12.04''' machine. The steps to take on other recent Linux distributions should be very similar.<br />
<br />
Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by [[User:KEvang|Kilian Evang]]; thanks are due to Tim Dawborn, Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.<br />
<br />
# Customize these variables:<br />
export CANDC_PREFIX=$HOME<br />
export CCGBANK=$HOME/data/CCGbank1.2<br />
export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem<br />
export NUMNODES=32<br />
export LIB=/usr/lib<br />
<br />
# Some variables for use below:<br />
export CANDC=$CANDC_PREFIX/candc<br />
export SCRIPTS=$CANDC/src/scripts/ccg<br />
export EXT=$CANDC/ext<br />
<br />
# Package dependencies:<br />
sudo apt-get install g++ gawk libibumad-dev mpich2 subversion<br />
<br />
# Check out the C&C tools.<br />
# You need credentials for that, see<br />
# http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion<br />
cd $CANDC_PREFIX<br />
svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400<br />
<br />
# Some patches to fix various problems with the scripts provided:<br />
<br />
# Use a temp directory different from /tmp since that often doesn't have enough<br />
# space:<br />
sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_*<br />
<br />
# Replace /bin/env by /usr/bin/env<br />
sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \<br />
$SCRIPTS/count_features<br />
<br />
# Work around non-portable sed -f shebang<br />
sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \<br />
$SCRIPTS/create_data<br />
<br />
# TODO patches to make the scripts work with the LDC version of CCGbank should<br />
# go here.<br />
<br />
# Make ext directory<br />
mkdir $EXT<br />
<br />
# Install Boost library (Ubuntu doesn't seem to have a version that is compiled<br />
# against MPICH2).<br />
echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI<br />
# library without this for some reason<br />
mkdir $EXT/install<br />
cd $EXT/install<br />
wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or<br />
# get it from Sourceforge<br />
tar -xzf boost_1_53_0.tar.gz<br />
cd boost_1_53_0<br />
./bootstrap.sh --with-libraries=mpi --prefix=$EXT<br />
./b2 install<br />
<br />
# Install ancient MR-MPI C&C depends on<br />
cd $EXT/install<br />
wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is<br />
# dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2<br />
tar jxf mrmpi-22Apr09.tbz2<br />
cd mrmpi-22Apr09/src<br />
make -f Makefile.linux clean<br />
make -f Makefile.linux<br />
cp *.h $EXT/include<br />
cp libmrmpi.a $EXT/lib<br />
<br />
# Build C&C<br />
cd $CANDC<br />
make -f Makefile.linux all train bin/generate<br />
<br />
# Create data<br />
# Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank<br />
$SCRIPTS/create_data $CCGBANK $NUMNODES working/<br />
<br />
# Train the POS tagger and Supertagger:<br />
$SCRIPTS/train_taggers working/<br />
<br />
# Evaluate the supertagger model to ensure its results are sane:<br />
$SCRIPTS/cl07_table4 working/<br />
<br />
# Create the model_hybrid directory and empty config file:<br />
mkdir working/model_hybrid<br />
touch working/model_hybrid/config<br />
<br />
# Train a hybrid model:<br />
export LD_LIBRARY_PATH=$EXT/lib:$LIB<br />
$SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/<br />
$SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/<br />
<br />
# Evaluate the parser model:<br />
$SCRIPTS/cl07_table7 working/</div>KEvanghttps://aclweb.org/aclwiki/index.php?title=Research&diff=10239Research2013-09-09T14:52:52Z<p>KEvang: </p>
<hr />
<div>This page is a list of links to information on research in Computational Linguistics.<br />
<br />
* [http://www.aclweb.org/anthology ACL Anthology] - more than 10,000 CL papers<br />
* [[Bibliographies]]<br />
* [[Books]]<br />
* [[Formalisms]]<br />
* [[Papers]]<br />
* [[Resources]]<br />
* [[Searching for papers]]<br />
* [[Wikipedia articles]] - on topics related to Computational Linguistics<br />
<br />
== ACL Wiki articles and tutorials ==<br />
Write your own article or tutorial!<br />
<!-- Please keep this list in alphabetical order --><br />
* [[Active Learning for NLP]] (stub)<br />
* [[Computational Lexicology]]<br />
* [[Computational Morphology]] (stub)<br />
* [[Computational Phonology]]<br />
* [[Computational Semantics]]<br />
* [[Computational Syntax]]<br />
* [[Constrained Conditional Model]] (stub)<br />
* [[Dialectometrics]]<br />
* [[Dialogue Systems]] (stub)<br />
* [[Distributional Hypothesis]]<br />
* [[Graph Based Methods]] (stub)<br />
* [[Information Extraction]] (stub)<br />
* [[Lexical Acquisition]] (stub)<br />
* [[Machine Translation]] (stub)<br />
* [[Natural Language Generation Portal]]<br />
* [[Natural Language Understanding]] (redirect)<br />
* [[Multiword Expressions]] (stub)<br />
* [[Parsing]] (stub)<br />
* [[Part-of-speech tagging]]<br />
* [[Question Answering]]<br />
* [[Semantics]] (stub)<br />
* [[Speech Processing]]<br />
* [[Statistical Semantics]]<br />
* [[Text Categorization]]<br />
* [[Textual Entailment]]<br />
* [[Text Summarization]] (stub)<br />
* [[Training the C&C Parser]]<br />
* [[Word Sense Disambiguation]]<br />
<!-- Please keep this list in alphabetical order --><br />
<br />
[[Category:Research|*]]</div>KEvang