Final CFP: The 6th Workshop on Asian Translation

Event Notification Type: 
Call for Papers
Abbreviated Title: 
WAT2019
Location: 
Asia World Expo
Country: 
China
City: 
Hong Kong
Contact: 
Toshiaki Nakazawa
Submission Deadline: 
Monday, 19 August 2019

Final call for papers
---------------------------------------------------------------------------
WAT2019
(The 6th Workshop on Asian Translation)
in conjunction with EMNLP-IJCNLP2019
http://lotus.kuee.kyoto-u.ac.jp/WAT/
November 3or4, 2019, Hong Kong, China

Following the success of the previous WAT workshops (WAT2014 --
WAT2018), WAT2019 will bring together machine translation researchers
and users to try, evaluate, share and discuss brand-new ideas about
machine translation. For the 6th WAT, we will include the following
new translation tasks:

* Japanese <--> English timely disclosure documents task
* Khmer <--> English Mixed-domain task
* Tamil <--> English Mixed-domain task
* Russian <--> Japanese News Commentary task
* English --> Hindi multimodal task

In addition to the shared tasks, the workshop will also feature
scientific papers on topics related to the machine translation,
especially for Asian languages. Topics of interest include, but are
not limited to:

- analysis of the automatic/human evaluation results in the past WAT workshops
- word-/phrase-/syntax-/semantics-/rule-based, neural and hybrid machine translation
- Asian language processing
- incorporating linguistic information into machine translation
- decoding algorithms
- system combination
- error analysis
- manual and automatic machine translation evaluation
- machine translation applications
- quality estimation
- domain adaptation
- machine translation for low resource languages
- language resources

************************* IMPORTANT NOTICE *************************
Participants of the previous workshop are also required to sign up to
WAT2019
********************************************************************

TRANSLATION TASKS
-----------------

The task is to improve the text translation quality for scientific
papers and patent documents. Participants choose any of the subtasks
in which they would like to participate and translate the test data
using their machine translation systems. The WAT organizers will
evaluate the results submitted using automatic evaluation and human
evaluation. We will also provide a baseline machine translation.

Tasks:
Scientific Paper: [Asian Scientific Paper Excerpt Corpus (ASPEC)]
English/Chinese <--> Japanese
Patent: [Japan Patent Office Patent Corpus 2.0 (JPC2)]
English/Chinese/Korean <--> Japanese
Timely Disclosure: [Timely Disclosure Documents Corpus] NEW!!
Japanese <--> English
Newswire: [JIJI Corpus]
Japanese <--> English
News Commentary: NEW!!
Japanese <--> Russian (Japanese <--> English and English <--> Russian included)
Mixed domain:
Myanmar <--> English [UCSY and ALT corpora]
Khmer <--> English [ECCC and ALT corpora] NEW!!
Indic:
Hindi <--> English [IIT Bombay (IITB) corpus]
Tamil <--> English [UFAL (EnTam) corpus] NEW!!
Hindi --> English Multimodal NEW!!

Dataset:

* Scientific paper

WAT uses ASPEC for the dataset including training, development,
development test and test data. Participants of the scientific papers
subtask must get a copy of ASPEC by themselves. ASPEC consists of
approximately 3 million Japanese-English parallel sentences from paper
abstracts (ASPEC-JE) and approximately 0.7 million Japanese-Chinese
paper excerpts (ASPEC-JC)

* Patent

WAT uses JPO Patent Corpus, which is constructed by Japan Patent
Office (JPO). This corpus consists of 1 million English-Japanese
parallel sentences, 1 million Chinese-Japanese parallel sentences, and
1 million Korean-Japanese parallel sentences from patent description
with four categories. Participants of patent tasks are required to get
it on WAT2019 site of JPO Patent Corpus.

- English/Chinese/Korean <--> Japanese:
These tasks evaluate performance of a translation model similarly as
the other translation tasks. Differing from the previous tasks at
WAT2015, WAT2016 and WAT2017, new test sets of these tasks consists
of (a) patent documents published between 2011 and 2013, which were
used in the past years' WAT, and (b) ones published between 2016 and
2017 for each language pair. We will also evaluate performance of the
section (a) so as to compare systems submitted in the past years'
WAT.

- Chinese -> Japanese expression pattern task:
This task evaluates performance of a translation model for each
predifined category of expression patterns, which corresponds to
title of invention (TIT), abstract (ABS), scope of claim (CLM) or
description (DES). Test set of this task consists of sentences each
of which is annotated with a corresponding category of expression
patterns.

* Timely Disclosure

WAT uses Timely Disclosure Documents Corpus, which is constructed by
was constructed by Japan Exchange Group (JPX). This corpus consists of
a Japanese-English timely disclosure corpus of 1.4M parallel
sentences. Participants of Timely Disclosure tasks are required to get
it on WAT2019 site of Timely Disclosure Documents Corpus.

* Newswire

WAT uses JIJI Corpus, which is constructed by Jiji Press Ltd. in
collaboration with the National Institute of Information and
Communications Technology (NICT). This corpus consists of a
Japanese-English news corpus of 200K parallel sentences, from Jiji
Press news with various categories. Participants of patents subtask
are required to get it on WAT2019 site of JIJI Corpus.

* News Commentary

WAT uses a manually aligned and cleaned Japanese <--> Russian corpus
from the News Commentary domain to study extremely low resource
situations for distant language pairs. The parallel corpus contains
around 12,000 lines and additionally we will provide Japanese <->
English and Russian <--> English in-domain and out-of-domain corpora
along with monolingual corpora. The corpus will be available after
18th May, 2019.

* Mixed domain

- Myanmar (Burmese) <--> English
WAT uses UCSY Corpus and ALT Corpus. The UCSY corpus and a portion of
the ALT corpus are use as training data, which are around 220,000
lines of sentences and phrases. The development and test data are
from the ALT corpus.

- Khmer <--> English
WAT uses ECCC Corpus and ALT Corpus. The ECCC corpus and a portion of
the ALT corpus are use as training data, which are around 120,000
lines of sentences and phrases. The development and test data are
from the ALT corpus.

* Indic

- Hindi <--> English
WAT uses IITB Corpus for the dataset for training, development,
development test and test data. The training corpus is mixed domain
and contains around 1 million lines of sentences and phrases. In
order to access the corpus participants should sign the following
agreement, scan and send it to the addresss mentioned in it. The
training corpus is a mixed domain corpus. The development and test
set are from the News domain and are exactly the same as the ones in
WMT 2014.

-- Vanilla subtask
Develop Hindi-English and English-Hindi MT system using only the
provided IITB English-Hindi Parallel and Monolingual corpora.

-- Multilingual NMT subtask
Multilingual NMT using additional XX-En corpus to improve Hi-En
translation task. Multilingual NMT can be done using Transfer
Learning (Zoph et al. 2016) or using Joint Learning (Johnson et
al. 2016). The choice of the additional corpus is up to the
participant. One possible choice is Arabic-English UN corpus of
approximately 11 million lines.

- Tamil <--> English
WAT will use the EnTam Corpus corpus collected by researchers at
UFAL. The training data contains around 160,000 lines of parallel
corpora. The data belongs to three domains: Cinema, News and Bible.

- Hindi --> English Multimodal (Visual Genome)
For the first time WAT will be organizing a multimodal English -->
Hindi translation task where the input will be text and an Image and
the output will be a caption (text). The training set contains around
30,000 segments. Additional details will be given on the task
website.

EVALUATION
----------

Automatic evaluation:
We are providing an automatic evaluation server. It is for free for
everyone, but you need to create an account for evaluation. Just
showing the list of evaluation results does not require an account.

Sign-up: http://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2019/index.html
Eval. result: http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/index.html

Human evaluation:
Both crowdsourcing evaluation and JPO adequacy evaluation will be
carried out for selected subtasks and selected submitted systems (the
details will be announced later).

INVITED TALK
------------

TBA

ORGANIZERS
----------

Toshiaki Nakazawa, The University of Tokyo, Japan
Chenchen Ding, National Institute of Information and Communications Technology (NICT), Japan
Raj Dabre, National Institute of Information and Communications Technology (NICT), Japan
Anoop Kunchukuttan, Microsoft AI and Research, India
Win Pa Pa, University of Computer Studies, Yangon (UCSY), Myanmar
Nobushige Doi, Japan Exchange Group (JPX), Japan
Yusuke Oda, Google, Japan
Ondřej Bojar, Charles University, Prague, Czech Republic
Shantipriya Parida, Idiap Research Institute, Martigny, Switzerland
Isao Goto, Japan Broadcasting Corporation (NHK), Japan
Hidaya Mino, Japan Broadcasting Corporation (NHK), Japan
Hiroshi Manabe, National Institute of Information and Communications Technology (NICT), Japan
Katsuhito Sudoh, Nara Institute of Science and Technology (NAIST), Japan
Sadao Kurohashi, Kyoto University, Japan
Pushpak Bhattacharyya, Indian Institute of Technology Patna (IITP), India

CONTACT
-------

wat-organizer@googlegroups.com