Overview of the 5th Workshop on Asian Translation

This paper presents the results of the shared tasks from the 6th workshop on Asian translation (WAT2019) including Ja↔En, Ja↔Zh scientific paper translation subtasks, Ja↔En, Ja↔Ko, Ja↔En patent translation subtasks, Hi↔En, My↔En, Km↔En, Ta↔En mixed domain subtasks and Ru↔Ja news commentary translation task. For the WAT2019, 25 teams participated in the shared tasks. We also received 10 research paper submissions out of which 61 were accepted. About 400 translation results were submitted to the automatic evaluation server, and selected submis- sions were manually evaluated.


Introduction
The Workshop on Asian Translation (WAT) is a new open evaluation campaign focusing on Asian languages. Following the success of the previous workshops WAT2014-WAT2017 (Nakazawa et al., 2014;Nakazawa et al., 2015;Nakazawa et al., 2016;Nakazawa et al., 2017), WAT2018 brings together machine translation researchers and users to try, evaluate, share and discuss brand-new ideas of machine translation. We have been working toward practical use of machine translation among all Asian countries.
For the 5th WAT, we adopted new translation subtasks with Myanmar ↔ En-glish mixed domain corpus 1 and Bengali/Hindi/Malayalam/Tamil/Telugu/Urdu/Sinhalese ↔ English OpenSubtitles corpus 2 in addition to the subtasks at WAT2017.
WAT is the uniq workshop on Asian language transration with the following characteristics: • Open innovation platform Due to the fixed and open test data, we can repeatedly evaluate translation systems on the same dataset over years. WAT receives submissions at any time; i.e., there is no submission deadline of translation results w.r.t automatic evaluation of translation quality.  2 Dataset 2.1 ASPEC ASPEC was constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). The corpus consists of a Japanese-English scientific paper abstract corpus (ASPEC-JE), which is used for ja↔en subtasks, and a Japanese-Chinese scientific paper excerpt corpus (ASPEC-JC), which is used for ja↔zh subtasks. The statistics for each corpus are shown in Table 1.

ASPEC-JE
The training data for ASPEC-JE was constructed by NICT from approximately two million Japanese-English scientific paper abstracts owned by JST. The data is a comparable corpus and sentence correspondences are found automatically using the method from (Utiyama and Isahara, 2007). Each sentence pair is accompanied by a similarity score that are calculated by the method and a field ID that indicates a scientific field. The correspondence between field IDs and field names, along with the frequency and occurrence ratios for the training data, are described in the README file of ASPEC-JE.
The development, development-test and test data were extracted from parallel sentences from the Japanese-English paper abstracts that exclude the sentences in the training data. Each dataset consists of 400 documents and contains sentences in each field at the same rate. The document alignment was conducted automatically and only documents with a 1-to-1 alignment are included. It is therefore possible to restore the original documents. The format is the same as the training data except that there is no Lang Train Dev DevTest Test-N zh-ja 1,000,000 2,000 2,000 5,204 ko-ja 1,000,000 2,000 2,000 5,230 en-ja 1,000,000 2,000 2,000 5,668 Lang Test-N1 Test-N2 Test-N3 Test-EP zh-ja 2,000 3,000 204 1,151 ko-ja 2,000 3,000 230 en-ja 2,000 3,000 668 - Table 2: Statistics for JPC similarity score.

ASPEC-JC
ASPEC-JC is a parallel corpus consisting of Japanese scientific papers, which come from the literature database and electronic journal site J-STAGE by JST, and their translation to Chinese with permission from the necessary academic associations. Abstracts and paragraph units are selected from the body text so as to contain the highest overall vocabulary coverage.
The development, development-test and test data are extracted at random from documents containing single paragraphs across the entire corpus. Each set contains 400 paragraphs (documents). There are no documents sharing the same data across the training, development, development-test and test sets.

JPC
JPO Patent Corpus (JPC) for the patent tasks was constructed by the Japan Patent Office (JPO) in collaboration with NICT. The corpus consists of Chinese-Japanese, Korean-Japanese and English-Japanese patent descriptions whose International Patent Classification (IPC) sections are chemistry, electricity, mechanical engineering, and physics.
At WAT2018, the patent tasks has two subtasks: normal subtask and expression pattern subtask. Both subtasks uses common training, development and development-test data for each language pair. The normal subtask for three language pairs uses four test data with different characteristics: • test-N: union of the following three sets; • test-N1: patent documents from patent families  published between 2011 and 2013;  Lang   Train  Dev DevTest  Test  en-ja 200,000 2,000 2,000 2,000  Table 3. The sentence pairs in each data are identified in the same manner as that for ASPEC using the method from (Utiyama and Isahara, 2007).

IITB Corpus
IIT Bombay English-Hindi Corpus contains English-Hindi parallel corpus as well as monolingual Hindi corpus collected from a variety of sources and corpora. This corpus had been developed at the Center for Indian Language Technology, IIT Bombay over the years. The corpus is used for mixed domain tasks hi↔en. The statistics for the corpus are shown in Table 4.

Recipe Corpus
Recipe Corpus was constructed by Cookpad Inc. Each recipe consists of a title, ingredients, steps, a Lang  Train  Dev  Test  Mono  hi-en 1,492,827  520 2,507  hi-ja  152,692 1,566 2,000  hi  ---45,075,279   The statistics for each corpus are described in Table  5.

ALT and UCSY Corpus
The parallel data for Myanmar-English translation tasks at WAT2018 consists of two corpora, the ALT corpus and UCSY corpus.
• The ALT corpus is one part from the Asian Language Treebank (ALT) project (Riza et al., 2016), consisting of twenty thousand Myanmar-English parallel sentences from news articles.    Table 6.

Indic Languages Corpus
The Indic Languages Corpus covers 8 languages, namely: Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu and English. The corpus has been collected from OPUS 4 and belongs to the spoken language (OpenSubtitles) domain. This corpus is used for the pilot as well as multilingual English↔Indic Languages sub-tasks. The corpus is a collection of 7 bilingual parallel corpora of varying sizes, one for each Indic language and English. The parallel corpora are also accompanied by monolingual corpora from the same domain. The statistics of the parallel and monolingual corpora are given in Table 7.

Baseline Systems
Human evaluations were conducted as pairwise comparisons between the translation results for a specific baseline system and translation results for each participant's system. That is, the specific baseline system was the standard for human evaluation. At WAT 2018, we adopted a neural machine translation (NMT) with attention mechanism as a baseline system except for the IITB tasks. We used a phrasebased statistical machine translation (SMT) system, which is the same system as that at WAT 2017, as the baseline system for the IITB tasks.
The NMT baseline systems consisted of publicly available software, and the procedures for building the systems and for translating using the systems were published on the WAT web page. 5 We used OpenNMT (Klein et al., 2017) as the implementation of the baseline NMT systems. In addition to the NMT baseline systems, we have SMT baseline systems for the tasks that started at last year or before last year. The baseline systems are shown in Tables 8, 9, and 10.
SMT baseline systems are described in the previous WAT overview paper (Nakazawa et al., 2017). The commercial RBMT systems and the online translation systems were operated by the organizers. We note that these RBMT companies and online translation companies did not submit themselves. Because our objective is not to compare commercial RBMT systems or online translation systems from companies that did not themselves participate, the system IDs of these systems are anonymous in this paper.

Training Data
We used the following data for training the NMT baseline systems.
• All of the training data for each task were used for training except for the ASPEC Japanese-English task. For the ASPEC Japanese-English task, we only used train-1.txt, which consists of one million parallel sentence pairs with high similarity scores. • All of the development data for each task was used for validation.

Tokenization
We used the following tools for tokenization.
When we built BPE-codes, we merged source and target sentences and we used 100,000 for -s option. We used 10 for vocabulary-threshold when subword-nmt applied BPE.

NMT with attention
We used the following OpenNMT configuration for the NMT with attention system. The default values were used for the other system parameters. For many to one, one to many, and many to many multilingual NMT (Johnson et al., 2017), we add <2XX> tags, which indicate the target language (XX is replaced by the language code), to the head of the source language sentences.

Procedure for Calculating Automatic Evaluation Score
We evaluated translation results by three metrics: BLEU (Papineni et al., 2002), RIBES (Isozaki et al., 2010) and AMFM (Banchs et al., 2015). BLEU scores were calculated using multi-bleu.perl in the Moses toolkit (Koehn et al., 2007). RIBES scores were calculated using RIBES.py version 1.02.4. 11 AMFM scores were calculated using scripts created by the technical collaborators listed in the WAT2018 web page. 12 All scores for each task were calculated using the corresponding reference translations. Before the calculation of the automatic evaluation scores, the translation results were tokenized or segmented with tokenization/segmentation tools for each language. For Japanese segmentation, we used three different tools: Juman version 7.0 (Kurohashi et al., 1994), KyTea 0.4.6 (Neubig et al., 2011) with full SVM model 13 and MeCab 0.996 (Kudo, 2005) with IPA dictionary 2.7.0. 14 For Chinese segmentation, we used two different tools: KyTea 0.4.6 with full SVM Model in MSR model and Stanford Word Segmenter (Tseng, 2005)

Automatic Evaluation System
The automatic evaluation system receives translation results by participants and automatically gives evaluation scores to the uploaded results. As shown in Figure 1, the system requires participants to provide the following information for each submission: • Human Evaluation: whether or not they submit the results for human evaluation; • Publish the results of the evaluation: whether or not they permit to publish automatic evaluation scores on the WAT2018 web page.
• Task: the task you submit the results for; • Used Other Resources: whether or not they used additional resources; and • Method: the type of the method including SMT, RBMT, SMT and RBMT, EBMT, NMT and Other.
Evaluation scores of translation results that participants permit to be published are disclosed via the WAT2018 evaluation web page. 20 Participants can also submit the results for human evaluation using the same web interface. This automatic evaluation system will remain available even after WAT2018. Anybody can register an account for the system by the procedures described in the registration web page.

Human Evaluation
In WAT2018, we conducted two kinds of human evaluations: pairwise evaluation and JPO adequacy evaluation.

Pairwise Evaluation
We conducted pairwise evaluation for participants' systems submitted for human evaluation. The submitted translations were evaluated by a professional translation company and Pairwise scores were given to the submissions by comparing with baseline translations (described in section 3).

Sentence Selection and Evaluation
For the pairwise evaluation, we randomly selected 400 sentences from the test set of each task. We used the same sentences as the last year for the continuous subtasks. Baseline and submitted translations were shown to annotators in random order with the input source sentence. The annotators were asked to judge which of the translations is better, or whether they are on par.

Voting
To guarantee the quality of the evaluations, each sentence is evaluated by 5 different annotators and the final decision is made depending on the 5 judgements. We define each judgement j i (i = 1, · · · , 5) as: if better than the baseline −1 if worse than the baseline 0 if the quality is the same The final decision D is defined as follows using S = ∑ j i :

Pairwise Score Calculation
Suppose that W is the number of wins compared to the baseline, L is the number of losses and T is the number of ties. The Pairwise score can be calculated by the following formula: From the definition, the Pairwise score ranges between -100 and 100. The latest versions of Chrome, Firefox, Internet Explorer and Safari are supported for this site.
Before you submit files, you need to enable JavaScript in your browser.
File format: Submitted files should NOT be tokenized/segmented. Please check the automatic evaluation procedures.
Submitted files should be encoded in UTF-8 format.
Translated sentences in submitted files should have one sentence per line, corresponding to each test sentence. The number of lines in the submitted file and that of the corresponding test file should be the same.

Human evaluation:
If you want to submit the file for human evaluation, check the box "Human Evaluation". Once you upload a file with checking "Human Evaluation" you cannot change the file used for human evaluation.
When you submit the translation results for human evaluation, please check the checkbox of "Publish" too.
You can submit two files for human evaluation per task.
One of the files for human evaluation is recommended not to use other resources, but it is not compulsory.

Other:
Team Name, Task, Used Other Resources, Method, System Description (public) , Date and Time(JST), BLEU, RIBES and AMFM will be disclosed on the Evaluation Site when you upload a file checking "Publish the results of the evaluation".
You can modify some fields of submitted data. Read "Guidelines for submitted data" at the bottom of this page.

Confidence Interval Estimation
There are several ways to estimate a confidence interval. We chose to use bootstrap resampling (Koehn, 2004) to estimate the 95% confidence interval. The procedure is as follows: 1. randomly select 300 sentences from the 400 human evaluation sentences, and calculate the Pairwise score of the selected sentences 2. iterate the previous step 1000 times and get 1000 Pairwise scores 3. sort the 1000 scores and estimate the 95% confidence interval by discarding the top 25 scores and the bottom 25 scores

JPO Adequacy Evaluation
We conducted JPO adequacy evaluation for the top two or three participants' systems of pairwise evalution for each subtask. 22 The evaluation was carried out by translation experts based on the JPO adequacy evaluation criterion, which is originally defined by JPO to assess the quality of translated patent documents.

Sentence Selection and Evaluation
For the JPO adequacy evaluation, the 200 test sentences were randomly selected from the 400 test sentences used for the pairwise evaluation. For each test sentence, input source sentence, translation by participants' system, and reference translation were shown to the annotators. To guarantee the quality of the evaluation, each sentence was evaluated by two annotators. Note that the selected sentences are the same as those used in the previous workshops except for the new subtasks at WAT2018. Table 11 shows the JPO adequacy criterion from 5 to 1. The evaluation is performed subjectively. "Important information" represents the technical factors and their relationships. The degree of importance of each element is also considered to evaluate. The percentages in each grade are rough indications for the 22 The number of systems varies depending on the subtasks.

Evaluation Results
In this section, the evaluation results for WAT2018 are reported from several perspectives. Some of the results for both automatic and human evaluations are also accessible at the WAT2018 website. 24  Table 14. The weights for the weighted κ (Cohen, 1968) is defined as |Evaluation1 − Evaluation2|/4.

Statistical Significance Testing of Pairwise Evaluation between Submissions
Tables 15 and 16 show the results of statistical significance testing of ASPEC subtasks, Table 17 shows that of IITB subtasks, Table 18 shows that of  ALT subtasks and Tables 19 and 20 show those of INDIC subtasks. ≫, ≫ and > mean that the system in the row is better than the system in the column at a significance level of p < 0.01, 0.05 and 0.1 respectively. Testing is also done by the bootstrap resampling as follows: 1. randomly select 300 sentences from the 400 pairwise evaluation sentences, and calculate the Pairwise scores on the selected sentences for both systems 2. iterate the previous step 1000 times and count the number of wins (W ), losses (L) and ties (T )

Inter-annotator Agreement
To assess the reliability of agreement between the workers, we calculated the Fleiss' κ (Fleiss and others, 1971) values. The results are shown in Table  21. We can see that the κ values are larger for X → J translations than for J → X translations. This may be because the majority of the workers for these language pairs are Japanese, and the evaluation of one's mother tongue is much easier than for other languages in general. The κ values for Hindi languages are relatively higt. This might be because the overall translation quality of the Hindi languages are low, and the evaluators can easily distinguish better translations from worse ones.

Conclusion and Future Perspective
This paper summarizes the shared tasks of WAT2018. We had 17 participants worldwide, and collected a large number of useful submissions for improving the current machine translation systems by analyzing the submissions and identifying the issues.
For the next WAT workshop, we plan to conduct documen-level evaluation using the new dataset with context for some translation subtasks and we would like to consider how to realize context-aware evaluation in WAT. Also, we are planning to do extrinsic evaluation of the translations.

Appendix A Submissions
Tables 23 to 37 summarize translation results submitted for WAT2018 human evaluation. Type, RSRC, Pair, and Adeq columns indicate type of method, use of other resources, pairwise evaluation score, and JPO adequacy evaluation score, respectively.
The tables also include results by the organizers' baselines, which are listed in Table 10. For ALT tasks, we also evaluated outputs of Online-A system and its post-processed version where the western comma (,) is replaced into Myanmar native comma (0x104a). We conducted the post-processing because Myanmar native punctuation marks are consistently used in the WAT 2018 dataset.  (Sen et al., 2018) Indian Institute of Technology Patna India cvit-mt (Philip et al., 2018) International Institute of Information Technology, Hyderabad India                  (0006) Osaka-U (2472)