Building and Using Comparable Corpora

Event Notification Type: 
Call for Papers
Abbreviated Title: 
BUCC'17
Location: 
ACL'17
Thursday, 3 August 2017
State: 
Country: 
Canada
City: 
Vancouver
Contact: 
Serge Sharoff
Pierre Zweigenbaum
Reinhard Rapp
Submission Deadline: 
Friday, 21 April 2017

10th Workshop on Building and Using Comparable Corpora
Shared task: detection of parallel sentences in Comparable Corpora

Important dates
Workshop Submission deadline: 21 April, 2017
Workshop Notification: 19 May, 2017
Workshop Camera Ready: 26 May, 2017

Website: https://comparable.limsi.fr/bucc2017/

*Shared task: Identifying parallel sentences in comparable corpora*

We announce a new shared task for 2017. As is well known, a bottleneck
in statistical machine translation is the scarceness of parallel
resources for many language pairs and domains. Previous research has
shown that this bottleneck can be reduced by utilizing parallel
portions found within comparable corpora. These are useful for many
purposes, including automatic terminology extraction and the training
of statistical MT systems.

The aim of the shared task is to quantitatively evaluate competing
methods for extracting parallel sentences from comparable monolingual
corpora, so as to give an overview on the state of the art and to
identify the best performing approaches.

Shared task sample set release: 6 February, 2017
Shared task training set release: 13 February, 2017
Shared task test set release: 21 April, 2017
Shared task test submission deadline: 28 April, 2017
Shared task camera ready papers: 26 May, 2017

Any submission to the shared task is expected to be accompanied
by a short paper (4 pages plus references). This will be accepted
for publication in the workshop proceedings automatically, although
the submission will go via Softconf with the standard peer-review
process.

Motivation

In the language engineering and the linguistics communities, research
in comparable corpora has been motivated by two main reasons. In
language engineering, it is chiefly motivated by the need to use
comparable corpora as training data for statistical NLP applications
such as statistical machine translation or cross-lingual retrieval. In
linguistics, on the other hand, comparable corpora are of interest in
themselves by making possible intra-linguistic discoveries and
comparisons. It is generally accepted in both communities that
comparable corpora are documents in one or several languages that are
comparable in content and form in various degrees and dimensions. We
believe that the linguistic definitions and observations related to
comparable corpora can improve methods to mine such corpora for
applications of statistical NLP. As such, it is of great interest to
bring together builders and users of such corpora.

TOPICS

We solicit contributions including but not limited to the following
topics.

Building Comparable Corpora:
• Human translations
• Automatic and semi-automatic methods
• Methods to mine parallel and non-parallel corpora from the Web
• Tools and criteria to evaluate the comparability of corpora
• Parallel vs non-parallel corpora, monolingual corpora
• Rare and minority languages, across language families
• Multi-media/multi-modal comparable corpora

Applications of comparable corpora:
• Human translations
• Language learning
• Cross-language information retrieval & document categorization
• Bilingual projections
• Machine translation
• Writing assistance
• Machine learning techniques using comparable corpora

Mining from Comparable Corpora:
• Induction of morphological, grammatical, and translation rules
from comparable corpora
• Extraction of parallel segments or paraphrases from comparable
corpora
• Extraction of bilingual and multilingual translations of single
words and multi-word expressions, proper names, and named
entities from comparable corpora
• Induction of multilingual word classes from comparable corpora
• Cross-language distributional semantics

Submission Information

See BUCC 2017 website: http://comparable.limsi.fr/bucc2017/

Workshop organisers:

Serge Sharoff (University of Leeds, UK), Chair
Pierre Zweigenbaum (LIMSI-CNRS, Orsay, France), Shared task organiser
Reinhard Rapp (University of Mainz, Germany)