Overview of the 6th Workshop on Asian Translation

This paper presents the results of the shared tasks from the 6th workshop on Asian translation (WAT2019) including Ja↔En, Ja↔Zh scientific paper translation subtasks, Ja↔En, Ja↔Ko, Ja↔En patent translation subtasks, Hi↔En, My↔En, Km↔En, Ta↔En mixed domain subtasks and Ru↔Ja news commentary translation task. For the WAT2019, 25 teams participated in the shared tasks. We also received 10 research paper submissions out of which 61 were accepted. About 400 translation results were submitted to the automatic evaluation server, and selected submis- sions were manually evaluated.


Introduction
The Workshop on Asian Translation (WAT) is an open evaluation campaign focusing on Asian languages. Following the success of the previous workshops WAT2014-WAT2018 (Nakazawa et al., 2014(Nakazawa et al., , 2015(Nakazawa et al., , 2016(Nakazawa et al., , 2017(Nakazawa et al., , 2018, WAT2019 brings together machine 1 One paper was withdrawn post acceptance and hence only 6 papers will be in the proceedings. translation researchers and users to try, evaluate, share and discuss brand-new ideas for machine translation. We have been working toward practical use of machine translation among all Asian countries.
For the 6th WAT, we adopted new translation subtasks with Khmer↔English and Tamil↔English mixed domain corpora, 2 Rus-sian↔Japanese news commentary corpus and English→Hindi multi-modal corpus 3 in addition to most of the subtasks of WAT2018.
WAT is a unique workshop on Asian language translation with the following characteristics: • Open innovation platform Due to the fixed and open test data, we can repeatedly evaluate translation systems on the same dataset over years. WAT receives submissions at any time; i.e., there is no submission deadline of translation results w.r.t automatic evaluation of translation quality.  • Domain and language pairs WAT is the world's first workshop that targets scientific paper domain, and Chi-nese↔Japanese and Korean↔Japanese language pairs. In the future, we will add more Asian languages such as Vietnamese, Thai and so on.
• Evaluation method Evaluation is done both automatically and manually. Firstly, all submitted translation results are automatically evaluated using three metrics: BLEU, RIBES and AMFM. Among them, selected translation results are assessed by two kinds of human evaluation: pairwise evaluation and JPO adequacy evaluation.

Datasets
2.1 ASPEC ASPEC was constructed by the Japan Science and Technology Agency (JST) in collaboration with the National Institute of Information and Communications Technology (NICT). The corpus consists of a Japanese-English scientific paper abstract corpus (ASPEC-JE), which is used for ja↔en subtasks, and a Japanese-Chinese scientific paper excerpt corpus (ASPEC-JC), which is used for ja↔zh subtasks. The statistics for each corpus are shown in Table 1.

ASPEC-JE
The training data for ASPEC-JE was constructed by NICT from approximately two million Japanese-English scientific paper abstracts owned by JST. The data is a comparable corpus and sentence correspondences are found automatically using the method from Utiyama and Isahara (2007). Each sentence pair is accompanied by a similarity score calculated by the method and a field ID that indicates a scientific field. The correspondence between field IDs and field names, along with the Lang Train Dev DevTest Test-N zh-ja 1,000,000 2,000 2,000 5,204 ko-ja 1,000,000 2,000 2,000 5,230 en-ja 1,000,000 2,000 2,000 5,668 Lang Test-N1 Test-N2 Test-N3 Test-EP zh-ja 2,000 3,000 204 1,151 ko-ja 2,000 3,000 230 en-ja 2,000 3,000 668 - Table 2: Statistics for JPC frequency and occurrence ratios for the training data, are described in the README file of ASPEC-JE. The development, development-test and test data were extracted from parallel sentences from the Japanese-English paper abstracts that exclude the sentences in the training data. Each dataset consists of 400 documents and contains sentences in each field at the same rate. The document alignment was conducted automatically and only documents with a 1-to-1 alignment are included. It is therefore possible to restore the original documents. The format is the same as the training data except that there is no similarity score.

ASPEC-JC
ASPEC-JC is a parallel corpus consisting of Japanese scientific papers, which come from the literature database and electronic journal site J-STAGE by JST, and their translation to Chinese with permission from the necessary academic associations. Abstracts and paragraph units are selected from the body text so as to contain the highest overall vocabulary coverage.
The development, development-test and test data are extracted at random from documents containing single paragraphs across the entire corpus. Each set contains 400 paragraphs (documents). There are no documents sharing the same data across the training, development, development-test and test sets.

JPC
JPO Patent Corpus (JPC) for the patent tasks was constructed by the Japan Patent Office (JPO) in collaboration with NICT. The corpus consists of Chinese-Japanese, Korean-Japanese and English-Japanese patent descriptions whose International Patent Classi -Disclosure  Train  Dev  DevTest  Test  Period  Texts  Items  Texts  Items  Texts  Items  2016-01-01 to  1,089,346  ------2017-12-31  (614,817)  ------2018-01-01 to  314,  fication (IPC) sections are chemistry, electricity, mechanical engineering, and physics. At WAT2019, the patent tasks has two subtasks: normal subtask and expression pattern subtask. Both subtasks use common training, development and development-test data for each language pair. The normal subtask for three language pairs uses four test data with different characteristics: • test-N: union of the following three sets; • test-N1: patent documents from patent families published between 2011 and 2013; • TSE is one of the largest capital markets in the world that has over 3,600 companies listed as of the end of 2018. Companies are required to disclose material information including financial statements, corporate actions, and corporate governance policies to the public in a timely manner. These timely disclosure documents form an important basis for investment decisions, containing important figures (e.g., sales, profits, significant dates) and proper nouns (e.g., names of persons, places, companies, business and product). Since such information is critical for investors, mistranslations should be avoided and translations should be of a high quality.
The corpus consists of Japanese-English sentence pairs, document hashes, and sentence hashes. A document hash is a hash of the Document ID, which is a unique identifier of the source document. A sentence hash is a hash of the Document ID and the Sentence ID, which is a unique identifier of the sentence in each source document.
The corpus is partitioned into training, development, development-test, and test data. The training data is split into two (2) sets of data from different periods. The first data set was created based on documents disclosed from January 1, 2016 to December 31, 2017, and the second data set was based on documents from January 1, 2018 to June 30, 2018. The development, development-test, and test data set were extracted from timely disclosure documents disclosed from January 1, 2018 to June 30, 2018, excluding documents that were used to create the training data. The documents for the period were randomly selected, and the sentences were extracted from each randomly selected, discrete document set so that the sources extracted are not biased. Therefore, the set of source documents for training, development, development-test and  Lang  Train  Dev DevTest  Test  en-ja 200,000 2,000 2,000 2,000   Table 4. The sentence pairs in each data are identified in the same manner as that for ASPEC using the method from (Utiyama and Isahara, 2007).

IITB Corpus
IIT Bombay English-Hindi Corpus contains English-Hindi parallel corpus as well as monolingual Hindi corpus collected from a variety of sources and corpora. This corpus had been developed at the Center for Indian Language Technology, IIT Bombay over the years. The corpus is used for mixed domain tasks hi↔en. The statistics for the corpus are shown in Table 5.

ALT and UCSY Corpus
The parallel data for Myanmar-English translation tasks at WAT2019 consists of two cor- Corpus  Train  Dev  Test  ALT  18,088 1,000 1,018  UCSY 204,539  --All 222,627 1,000 1,018 The ALT corpus has been manually segmented into words (Ding et al., , 2019, and the UCSY corpus is unsegmented. A script to tokenize the Myanmar data into writing units is released with the data. The automatic evaluation of Myanmar translation results is based on the tokenized writing units, regardless to the segmented words in the ALT data. However, participants can make a use of the segmentation in ALT data in their own manner.
The detailed composition of training, development, and test data of the Myanmar-English translation tasks are listed in Table 6. Notice that both of the corpora have been modified from the data used in WAT2018.

ALT and ECCC Corpus
The parallel data for Khmer-English translation tasks at WAT2019 consists of two corpora, the ALT corpus and ECCC corpus.
• The ALT corpus is one part from the Asian Language Treebank (ALT) project (Riza et al., 2016), consisting of twenty thousand Khmer-English parallel sentences from news articles.  The ALT corpus has been manually segmented into words , and the ECCC corpus is unsegmented. A script to tokenize the Khmer data into writing units is released with with the data. The automatic evaluation of Khmer translation results is based on the tokenized writing units, regardless to the segmented words in the ALT data. However, participants can make a use of the segmentation in ALT data in their own manner.
The detailed composition of training, development, and test data of the Khmer-English translation tasks are listed in Table 7.

Multi-Modal Task Corpus
For English→Hindi multi-modal translation task we asked the participants to use the Hindi Visual Genome corpus (HVG, Parida et al., 2019a,b). The statistics of the corpus are given in Table 8. One "item" in the original HVG consists of an image with a rectangular region highlighting a part of the image, the original English caption of this region and the Hindi reference translation. Depending on the track (see 2.8.1 below), some of these item components are available as the source and some serve as the reference or play the role of a competing candidate solution.
HVG Training, D-Test and E-Test sections were accessible to the participants in advance. The participants were explicitly instructed not to consult E-Test in any way but strictly speaking, they could have used the reference translation (which would mean cheating from the evaluation point of view). C-Test was provided only for the task itself: the source side  was distributed to task participants and the target side was published only after output submission deadline. Note that the original Visual Genome suffers from a considerable level of noise. Some observed English grammar errors are illustrated in Figure 1. We also took the chance and used our manual evaluation for validating the quality of the captions given the picture, see 8.4.1 below.
The multi-modal task includes three tracks as illustrated in Figure 1 in WAT official tables): The participants are asked to translate short English captions (text) into Hindi. No visual information can be used. On the other hand, additional text resources are permitted (but they need to be specified in the corresponding system description paper).
2. Hindi Captioning (labeled "HI"): The participants are asked to generate captions in Hindi for the given rectangular region in an input image.
3. Multi-Modal Translation (labeled "MM"): Given an image, a rectangular region in it and an English caption for the rectangular region, the participants are asked to translate the English text into Hindi. Both textual and visual information can be used.   translation quality as well but this is beyond the scope of this years sub-task. Refer to Table 10 for the statistics of the in-domain parallel corpora. In addition we encouraged the participants to use out-of-domain parallel corpora from various sources such as KFTT, 6 JESC, 7 TED, 8 ASPEC, 9 UN, 10 Yandex 11 and Russian↔English news-commentary corpus. 12

Baseline Systems
Human evaluations of most of WAT tasks were conducted as pairwise comparisons between the translation results for a specific baseline system and translation results for each partic-ipant's system. That is, the specific baseline system was the standard for human evaluation. At WAT 2019, we adopted a neural machine translation (NMT) with attention mechanism as a baseline system.
The NMT baseline systems consisted of publicly available software, and the procedures for building the systems and for translating using the systems were published on the WAT web page. 13 We also have SMT baseline systems for the tasks that started at WAT 2017 or before 2017. The baseline systems are shown in Tables 11, 12, and 13. SMT baseline systems are described in the WAT 2017 overview paper (Nakazawa et al., 2017). The commercial RBMT systems and the online translation systems were operated by the organizers. We note that these RBMT companies and online translation companies did not submit themselves. Because our objective is not to compare commercial RBMT systems or online translation systems from companies that did not themselves participate, the system IDs of these systems are anonymous in this paper.

Training Data
We used the following data for training the NMT baseline systems.
• All of the training data for each task were used for training except for the AS-PEC Japanese-English task. For the AS-PEC Japanese-English task, we only used train-1.txt, which consists of one million parallel sentence pairs with high similarity scores. • All of the development data for each task was used for validation.

Tokenization
We used the following tools for tokenization.    When we built BPE-codes, we merged source and target sentences and we used 100,000 fors option. We used 10 for vocabulary-threshold when subword-nmt applied BPE.

For EnTam, News Commentary
• The Moses toolkit for English and Russian only for the News Commentary data.
• The EnTam corpus is not tokenized by any external toolkits.
• Both corpora are further processed by tensor2tensor's internal pre/postprocessing which includes sub-word segmentation.

For Multi-Modal Task
• Hindi Visual Genome comes untokenized and we did not use or recommend any specific external tokenizer.
• The standard OpenNMT-py sub-word segmentation was used for pre/postprocessing for the baseline system and each participant used what they wanted.

Baseline NMT Methods
We used the following NMT with attention for most of the tasks. We used Transformer (Vaswani et al., 2017) (Tensor2Tensor)) for the News Commentary and English↔Tamil tasks and Transformer (OpenNMT-py) for the Multimodal task.

NMT with Attention
We used OpenNMT (Klein et al., 2017) as the implementation of the baseline NMT systems of NMT with attention (System ID: NMT). We used the following OpenNMT configuration.
• encoder_type = brnn The default values were used for the other system parameters.

Transformer (Tensor2Tensor)
For the News Commentary and En-glish↔Tamil tasks, we used tensor2tensor's 20 implementation of the Transformer (Vaswani et al., 2017) and use default hyperparameter settings corresponding to the "base" model for all baseline models. The baseline for the News Commentary task is a multilingual model as described in Imankulova et al. (2019) which is trained using only the in-domain parallel corpora. We use the token trick proposed by (Johnson et al., 2017) to train the multilingual model. As for the English↔Tamil task, we train separate baseline models for each translation direction with 32,000 separate sub-word vocabularies.

Transformer (OpenNMT-py)
For the Multimodal task, we used the Transformer model (Vaswani et al., 2018) as implemented in OpenNMT-py (Klein et al., 2017) and used the "base" model with default parameters for the multi-modal task baseline. We have generated the vocabulary of 32k subword types jointly for both the source and target languages. The vocabulary is shared between the encoder and decoder. (Isozaki et al., 2010) and AMFM (Banchs et al., 2015). BLEU scores were calculated using multi-bleu.perl in the Moses toolkit (Koehn et al., 2007). RIBES scores were calculated using RIBES.py version 1.02.4. 21 AMFM scores were calculated using scripts created by the technical collaborators listed in the WAT2019 web page. 22 All scores for each task were calculated using the corresponding reference translations.
Before the calculation of the automatic evaluation scores, the translation results were tokenized or segmented with tokenization/segmentation tools for each language. For Japanese segmentation, we used three different tools: Juman version 7.0 (Kurohashi et al., 1994), KyTea 0.4.6 (Neubig et al., 2011) with full SVM model 23 and MeCab 0.996 (Kudo, 2005)

Automatic Evaluation System
The automatic evaluation system receives translation results by participants and automatically gives evaluation scores to the uploaded results. As shown in Figure 2, the system requires participants to provide the following information for each submission: • Human Evaluation: whether or not they submit the results for human evaluation; • Publish the results of the evaluation: whether or not they permit to publish automatic evaluation scores on the WAT2019 web page.
• Task: the task you submit the results for; • Used Other Resources: whether or not they used additional resources; and • Method: the type of the method including SMT, RBMT, SMT and RBMT, EBMT, NMT and Other.
Evaluation scores of translation results that participants permit to be published are disclosed via the WAT2019 evaluation web page. Participants can also submit the results for human evaluation using the same web interface. This automatic evaluation system will remain available even after WAT2019. Anybody can register an account for the system by the procedures described in the registration web page. 32

Additional Automatic Scores in Multi-Modal Task
For the multi-modal task, several additional automatic metrics were run aside from the WAT evaluation server, namely: BLEU (now calculated by Moses scorer 33 ), characTER (Wang et al., 2016), chrF3 (Popović, 2015), TER (Snover et al., 2006), WER, PER and CDER (Leusch et al., 2006 The latest versions of Chrome, Firefox, Internet Explorer and Safari are supported for this site.
Before you submit files, you need to enable JavaScript in your browser.
File format: Submitted files should NOT be tokenized/segmented. Please check the automatic evaluation procedures.
Submitted files should be encoded in UTF-8 format.
Translated sentences in submitted files should have one sentence per line, corresponding to each test sentence. The number of lines in the submitted file and that of the corresponding test file should be the same. If you want to submit the file for human evaluation, check the box "Human Evaluation". Once you upload a file with checking "Human Evaluation" you cannot change the file used for human evaluation.
When you submit the translation results for human evaluation, please check the checkbox of "Publish" too.
You can submit two files for human evaluation per task.
One of the files for human evaluation is recommended not to use other resources, but it is not compulsory. Other: Team Name, Task, Used Other Resources, Method, System Description (public) , Date and Time(JST), BLEU, RIBES and AMFM will be disclosed on the Evaluation Site when you upload a file checking "Publish the results of the evaluation".
You can modify some fields of submitted data. Read "Guidelines for submitted data" at the bottom of this page. scores are lower, we reverse the score by taking 1 − x and indicate this by prepending "n" to the metric name. With this modification, higher scores always indicate a better translation result. Also, we multiply all metric scores by 100 for better readability.

Human Evaluation
In WAT2019, we conducted three kinds of human evaluations: pairwise evaluation (Section 5.1) and JPO adequacy evaluation (Section 5.2) for text-only language pairs and a pairwise variation of direct assessment (Section 5.3) for the multi-modal task.

Pairwise Evaluation
We conducted pairwise evaluation for participants' systems submitted for human evaluation. The submitted translations were evaluated by a professional translation company and Pairwise scores were given to the submissions by comparing with baseline translations (described in Section 3).

Sentence Selection and Evaluation
For the pairwise evaluation, we randomly selected 400 sentences from the test set of each task. We used the same sentences as the last year for the continuous subtasks. Baseline and submitted translations were shown to annotators in random order with the input source sentence. The annotators were asked to judge which of the translations is better, or whether they are on par.

Voting
To guarantee the quality of the evaluations, each sentence is evaluated by 5 different annotators and the final decision is made depending on the 5 judgements. We define each judgement j i (i = 1, · · · , 5) as: if better than the baseline −1 if worse than the baseline 0 if the quality is the same The final decision D is defined as follows using S = ∑ j i :

Pairwise Score Calculation
Suppose that W is the number of wins compared to the baseline, L is the number of losses and T is the number of ties. The Pairwise score can be calculated by the following formula: From the definition, the Pairwise score ranges between -100 and 100.

Confidence Interval Estimation
There are several ways to estimate a confidence interval. We chose to use bootstrap resampling (Koehn, 2004) to estimate the 95%

JPO Adequacy Evaluation
We conducted JPO adequacy evaluation for the top two or three participants' systems of pairwise evalution for each subtask. 35 The evaluation was carried out by translation experts based on the JPO adequacy evaluation criterion, which is originally defined by JPO to assess the quality of translated patent documents.

Sentence Selection and Evaluation
For the JPO adequacy evaluation, the 200 test sentences were randomly selected from the 400 test sentences used for the pairwise evaluation. For each test sentence, input source sentence, translation by participants' system, and reference translation were shown to the annotators.
To guarantee the quality of the evaluation, each sentence was evaluated by two annotators. Note that the selected sentences are the same as those used in the previous workshops except for the new subtasks at WAT2019. Table 14 shows the JPO adequacy criterion from 5 to 1. The evaluation is performed 35 The number of systems varies depending on the subtasks. subjectively. "Important information" represents the technical factors and their relationships. The degree of importance of each element is also considered to evaluate. The percentages in each grade are rough indications for the transmission degree of the source sentence meanings. The detailed criterion is described in the JPO document (in Japanese). 36

Manual Evaluation for the Multi-Modal Task
The evaluations of the three tracks of the multi-modal task follow the Direct Assessment (DA,  technique by asking annotators to assign a score from 0 to 100 to each candidate. The score is assigned using a slider with no numeric feedback, the scale is therefore effectively continuous. After a certain number of scored items, each of the annotators stabilizes in their predictions. The collected DA scores can be either directly averaged for each system and track (denoted "Ave"), or first standardized per annotator and then averaged ("Ave Z"). The standardization removes the effect of individual differences in the range of scores assigned: the scores are scaled so that the average score of each annotator is 0 and the standard deviation is 1.
Our evaluation differs from the basic DA in the following respects: (1) we run the evaluation bilingually, i.e. we require the annotators to understand the source English sufficiently to be able to assess the adequacy of the Hindi translation, (2) we ask the annotators to score two distinct segments at once, while the original DA displays only one candidate at a time.
The main benefit of bilingual evaluation is that the reference is not needed for the evalu-  ation. Instead, the reference can be included among other candidates and the manual evaluation allows us to directly compare the performance of MT to human translators.
The dual judgment (scoring two candidates at once) was added experimentally. The advantage is saving some of the annotators' time (they do not need to read the source or examine the picture again) and the chance to evaluate candidates also in terms of direct pairwise comparisons. In the history of WMT (Bojar et al., 2016), 5-way relative ranking was used for many years. With 5 candidates, the individual pairs may not be compared very precisely. With the single-candidate DA, pairwise comparisons cannot be used as the basis for system ranking. We believe that two candidates on one screen could be a good balance.
For the full statistical soundness, the judgments should be independent of each other. This is not the case in our dual scoring, even if we explicitly ask people to score the candidates independent of each other. The full independence is however not assured even in the original approach because annotators will remember their past judgments. This year, WMT even ran DA with document context available to the annotators by scoring all segments from a given document one after another in their natural order. We thus dare to pretend independence of judgments when interpreting DA scores. The user interface for our annotation for each of the tracks is illustrated in Figure 3, Figure 4, and Figure 5.
In the "text-only" evaluation, one English text (source) and two Hindi translations (candidate 1 and 2) are shown to the annotators. In the "multi-modal" evaluation, the annotators are shown both the image and the source English text. The first question is to validate if the source English text is a good caption for the indicated area. For two translation candidates, the annotators are asked to independently indicate to what extent the meaning is preserved. The "Hindi captioning" evaluation shows only the image and two Hindi candidates. The annotators are reminded that the two captions should be treated independently and that each of them can consider a very different aspect of the region.

Evaluation Results
In this section, the evaluation results for WAT2019 are reported from several perspectives. Some of the results for both automatic and human evaluations are also accessible at the WAT2019 website. 37 Figures 6, 7, 8 and 9 show the official evaluation results of ASPEC subtasks, Figures 10,  11, 12, 13, 14 and 15 show those of JPC subtasks, Figures 16 and 17 show those of JIJI subtasks, Figures 18 and 19 Tables 17 and 18. The weights for the weighted κ (Cohen, 1968) is defined as |Evaluation1 − Evaluation2|/4. The automatic scores for the multi-modal task along with the WAT evaluation server BLEU scores are provided in Table 20. For each of the test sets (E-Test and C-Test), the scores are comparable across all the tracks (text-only, captioning or multi-modal translation) because of the underlying set of reference translations is the same. The scores for the captioning task will be however very low because captions generated independently of the English source caption are very likely to differ from the reference translation. For multi-modal task, Table 19 shows the manual evaluation scores for all valid system submissions. As mentioned above, we used the reference translation as if it was one of the competing systems, see the rows "Reference" in the table. The annotation was fully anonymized, so the annotators had no chance of knowing if they are scoring human translation or MT output.                                Table 21 shows the results of statistical significance testing of aspec-ja-en subtasks, Table 22 shows that of JIJI subtasks, Table 23 shows that of TDDC subtasks. ≫, ≫ and > mean that the system in the row is better than the system in the column at a significance level of p < 0.01, 0.05 and 0.1 respectively. Testing is also done by the bootstrap resampling as follows:

Official Evaluation Results
1. randomly select 300 sentences from the 400 pairwise evaluation sentences, and calculate the Pairwise scores on the selected sentences for both systems 2. iterate the previous step 1000 times and count the number of wins (W ), losses (L) and ties (T ) 3. calculate p = L W +L

Inter-annotator Agreement
To assess the reliability of agreement between the workers, we calculated the Fleiss' κ (Fleiss et al., 1971) values. The results are shown in Table 24. We can see that the κ values are larger for X→J translations than for J→X translations. This may be because the majority of the workers for these language pairs are Japanese, and the evaluation of one's mother tongue is much easier than for other languages in general. The κ values for Hindi languages are relatively higt. This might be because the overall translation quality of the Hindi languages are low, and the evaluators can easily distinguish better translations from worse ones.

Findings
In this section, we will show findings of some of the translation tasks.

TDDC
In the results of both the automatic evaluation and the human evaluation, every system translated most sentences correctly. According to the human evaluation of the subtasks of 'Items' and 'Texts', all evaluators rated more than 70% of all the pairs at 4 or 5. Most of these high-rated pairs consist of typical terms and sentences from timely disclosure documents. This tasks focus on the accurate translation of figures, so the evaluation criteria confirmed there are no mistranslation in the typical sentences containing figures, such as unit of money and dates. However, uncommon sentences used in timely disclosure documents tend to be mistranslated. For example, uncommon proper nouns tended to be omitted or mistranslated to other meaning words, besides sentences which has complex and uncommon structures, generally long sentences, caused errors at dependency of subordinate clauses.
In addition, some systems translated sentences without subjects into sentences with incorrect subjects. Japanese sentences often omit subjects and objects, which would normally be included in English. For example, a Japanese sentence, "当社普通株式 27,000 株 を上限とする。 "(Common shares of the Company, limited to a maximum of 27,000 shares), was translated to"(Unrelated company name) common stocks up to 27,000 shares".
Moreover, there are some incorrect modifiers or determiners. In Japanese timely disclosure documents, there are many variable prefix for dates, such as"本" (this),"当" (this),"次" (next), and "前"(last). Some systems translated sentences containing these words with incorrect year. For example, a Japanese sentence contains "当第 3 四半期連結会計期間 末" (the end of third quarter of this fiscal year) was translated to"the end of the third quarter of FY 2016".
In summary, the causes of these mistranslations are considered as follows: • It is difficult for the systems to translate long sentence and proper nouns which TDDC does not contain.
• Some source sentences are unclear due to lack of subjects and/or objects, so these are not suitable for English translation.
• TDDC contains not semantically balanced pairs and the systems might be affected strongly by either of source pair sentences.
On the other hand, some translations seem to be fitted to sentences of TDDC which are    freely and omitted redundant expressions, but evaluators mark them as low scores, probably because they are not literal translations. This result implies that it is necessary to create another evaluation criterion, which evaluates the correctness of transmitting information to investors correctly.

English↔Tamil Task
We observed that most participants used transfer learning techniques such as fine-tuning and mixed fine-tuning for Tamil→English translation leading to reasonably high quality translations. However, English→Tamil translation is still poor and the main reason is the lack of helping parallel corpora. We expect that utilization of large in-domain monolingual corpora for backtranslation should help alleviate this problem. We will provide such corpora for next year's task.

News Commentary Task
We only received 3 submissions for Rus-sian↔Japanese translation and all submissions leveraged multilingualism and multi-step fine-tuning proposed by Imankulova et al. (2019) and showed that carefully choosing corpora and robust training can dramatically enhance the quality of NMT for language pairs   that have very small in-domain parallel corpora. For next year's task we expect more submissions where participants will leverage additional larger helping monolingual as well as bilingual corpora.

Validation of Source English Captions
In the manual evaluation of multimodal track, our annotators saw both the picture and the source text (and the two scored candidates). We took this opportunity to double check the quality of the original HVG data. Prior to scoring the candidates, we asked our annotators to confirm that the source English text is a good caption for the indicated region of the image.
The results in Table 25 indicate that for a surprisingly high number of items we did not receive any answer. This confirms that even non-anonymous annotators can easily provide sloppy evaluations. It is possible that part of these omissions can be attributed to our annotation interface which was showing all items on one page and relying on scrolling. Next time, we will show only one annotation item on each page and also consider highlighting unanswered questions. Strictly requiring an answer would not be always appropriate but we need to ensure that annotators are aware that they are skipping a question.
Luckily, the bad source captions are not a frequent case, amounting to 1 or 2% of assessed examples.

Relation to Human Translation
The bilingual style of evaluation of the multimodal task allowed us to evaluate the reference translations as if they were yet another competing MT system. Table 19 thus lists also the "Reference".
Across the tracks and test sets (EV vs. CH), humans surpass MT candidates. One single exception is IDIAP run 2956 winning in textonly translation of the E-Test, but this is not confirmed on the C-Test (CH). The score of the anonymized system 683 on E-Test in multi-modal track (MM) has also almost reached human performance. These are not the first cases of MT performing on par with humans and we are happy to see this when targetting an Indian language.

Evaluating Captioning
While the automatic scores are comparable across tasks, the Hindi-only captioning ("HI") must be considered separately. Without a source sentence, both humans and machines are very likely to come up with highly varying textual captions. The same image can be described in many different aspects. All our automatic metrics compare the candidate caption with the reference one generally on the basis of the presence of the same character sequences, words or n-grams. Candidates diverging from the reference will get a low score regardless of their actual quality.
The automatic evaluation score for the "Hindi caption" is very very low as compared to other sub-tasks ("text-only" and "multimodal" translations) as can be seen in the Table 20. Even the human annotators couldn't give any score for most of the segments submitted from the "Hindi caption" entries due to the wrong caption generation.

Conclusion and Future Perspective
This paper summarizes the shared tasks of WAT2019. We had 25 participants worldwide, and collected a large number of useful submissions for improving the current machine translation systems by analyzing the submissions and identifying the issues.
For the next WAT workshop, we plan to conduct document-level evaluation using the new dataset with context for some translation subtasks and we would like to consider how to realize context-aware machine translation in WAT. Also, we are planning to do extrinsic evaluation of the translations.

Acknowledgement
The multi-modal shared task was supported by the following grants at Idiap Research Institute and Charles University.
• At Idiap Research Institute, the work was supported by an innovation project (under an InnoSuisse grant) oriented to im-prove the automatic speech recognition and natural language understanding technologies for German (Title: "SM2: Extracting Semantic Meaning from Spoken Material" funding application no. 29814.1 IP-ICT). The work was also supported by the EU H2020 project "Real-time network, text, and speaker analytics for combating organized crime" (ROXANNE), grant agreement: 833635.
• At Charles University, the work was supported by the grants 19-26934X (NEUREM3) of the Czech Science Foundation and "Progress" Q18+Q48 of Charles University, and using language resources distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic