UdS-DFKI Participation at WMT 2019: Low-Resource (en-gu) and Coreference-Aware (en-de) Systems

This paper describes the UdS-DFKI submission to the WMT2019 news translation task for Gujarati–English (low-resourced pair) and German–English (document-level evaluation). Our systems rely on the on-line extraction of parallel sentences from comparable corpora for the first scenario and on the inclusion of coreference-related information in the training data in the second one.


Introduction
This document describes the systems and experiments conducted to participate in the news translation tasks of WMT 2019 for Gujarati-English (gu-en, low-resourced language pair) and German-English (de-en, document-level evaluation). We use different approaches to tackle each setting.
Machine translation (neural, statistical or rulebased), usually operates on a sentence-bysentence basis. However, when translating a coherent document, surrounding sentences may contain information that needs to be reflected in a local sentence. In our experiments for the document-level task in en2de, we explore how the information beyond sentence level can be made available to a neural machine translation (NMT) system by modifying -tagging-the data in order to include this knowledge. In a similar way, multilingual NMT systems have already been successfully built by only tagging the source data with the knowledge of the target language (Johnson et al., 2017;Ha et al., 2016). With this approach, we incorporate the knowledge that carries coreferences through a text in every sentence. We expect to improve the translation of ambiguous items such as pronouns in English, so we just tackle a specific number of problems and not translation quality in general.
The approach for the low-resource setting is completely different. In this case, we use a neural architecture that allows us to extract parallel data from comparable corpora and filter noise from the available parallel data. The additional data obtained in this way is then used to train SMT models, which we compare to a baseline trained on the available parallel data only to observe the effects of the extraction and filtering.
Below, we describe our coreference-aware system for en2de (Section 2) and our low-resourced approach for en-gu (Section 3). Finally we summarise our findings in Section 4.

Coreference-Aware
English-to-German System

Data Preparation
Our system makes use of the annotation of coreference mentions through documents in the source side of the corpus. Documents are annotated with coreference chains using a neural-networkbased mention-ranking model as implemented by the Stanford CoreNLP tool (Manning et al., 2014) 1 . The tool detects pronominal, nominal and proper names as mentions in a chain. For every mention, CoreNLP extracts its gender (male, female, neutral, unknown), number (singular, plural, unknown), and animacy (animate, inanimate, unknown). This information is not added directly but used to enrich the MT training data by applying a set of heuristics implemented in DocTrans 2 : • We enrich pronominal mentions with the head of the chain -Pronoun "I" is not enriched with any coreference information -We clean the head by removing articles and Saxon genitives and we only consider heads with less than 4 tokens in order to avoid enriching a word with a full sentence • We enrich nominal mentions including proper names with the gender of the head • The head itself is enriched with she/he/it/they depending on its gender and animacy The example below shows how we tag the cleaned version of the head of the chain (fish skin) before a pronominal mention (it): baseline: I never cook with it. coref: I never cook with <b crf> fish skin <e crf> it.
In order to be able to do this processing, we need documents and that limits the amount of corpora we can use. Even though all the corpora made available for the shared task have document boundaries, ParaCrawl, for instance, has a mean of 1.06 sentences per document which makes it useless within our approach.

Corpus
Monolingual corpora. We use a subset of the NewsCrawl corpus in English and German (years 2014, 2017 and a part of 2018, named as ss-NewsCrawl in Table 1) to calculate word embeddings as explained in Section 2.3. We first use langdetect 3 to extract only those sentences that are in the desired language and compile the final corpora to have a similar number of subword units (Sennrich et al., 2016a) in both languages and years (∼ 4. 10 9 ). The corpus is further cleaned, tokenised, truecased (with Moses scripts 4 ) and BPEd (with subword-nmt 5 ). The vocabulary of the BPE model depends on the system and is detailed in Section 2.3.
Parallel corpora. Due to the restrictions explained in Section 2.1, we use the parallel corpora made available for the shared task in different proportions. Europarl, News Commentary and Rapid Corpus.
Our large system also uses the ParaCrawl corpus but in a diluted way. The purpose of the dilution is to try to minimise the fact that due to the nature of our system we cannot use single sentences (intrasentence dependencies are already learned by an NMT system) or back-translations (quality is not good enough to extract coreference chains in a source sentence that is an automatic translation). CommonCrawl, Europarl and News Commentary are cleaned, tokenised, truecased and BPEd with the same tools as the monolingual corpus. For the Rapid corpus, we performed an additional cleaning: since some German sentences were missing umlauts, we removed all the sentences that contained any word clearly missing an umlaut such as europishen or erklrte. For ParaCrawl, we first removed sentence pairs that were not detected as English and German sentences by langdetect and afterwards we removed sentences with emoji, bullets, and specific tokens such as http, pdf, e, or hotel, etc. With this, we reduce the corpus size by more than half of the sentences. The final number of sentences for all the corpora used for training are provided in Table 1. Notice that we do oversampling for the News Commentary corpus as it is supposed to have a similar domain to the test set.
Transformer big. As Transformer base but with word embeddings with 1024-dim, 4096-dim hidden feed-forward layers, learning rate of 0.0002 with the same warmup and decay. β 2 =0.998.
Using these architectures as basis, we train several models on 4 TITAN X GPUs using an adaptive batch size that differ on: • Corpus size. Small vs. Large as defined in Table 1 • Vocabulary. Joint en-de BPE with 40K subword units (join) vs. separated vocabularies with 50K subword units each (all the other models).
• Annotation. No annotation (Baseline) vs. tags with coreference information (all the other models).
• Ensembling. Combinations of the previous models at decoding time.
The terms in parenthesis refer to the models in Table 2. Model names are structured as architectureVocabulary-Annotation -Embeddings-Corpus. Table 2 shows the BLEU scores of the different models and ensembles on newstest-2017 (validation) and news-test2018 (test). The first block presents the results of a baseline system without any document-level information; the second block shows the models explored to determine the best configuration; and the third block summarises  the ensembling combinations explored in order to chose our primary submission. The first thing to notice is that in terms of BLEU systems with and without coreference annotations are not significantly different (M01 vs. M08; M02 vs. M09/M11). Since we are modifying only specific aspects of the translation -few words in a document-, we do not obtain large improvements according to automatic evaluation measures, but we expect differences in translation quality according to human evaluators.

Results
The vocabulary turned out to be critical. A system with a joint vocabulary of 40K subword units (M03) is 5-6 BLEU points below its counterpart with 50k units and independent vocabularies (M04).
Embeddings are not that decisive. An initialisation of the system using bilingual embeddings slightly improves the results (M07 vs. M05; M10 vs. M09; M12 vs.M11). Using monolingual embeddings implies a very slow training. M06 in Table 2 is 10 BLEU points below its counterpart with bilingual embeddings (M07), but the training was far from converging even when running for more days.
As expected, increasing the size of the corpus and the number of parameters of the architecture is beneficial for the final translation quality. The former has the only disadvantage of needing more time and computing power. The latter even if achieving around 2 BLEU points of improvement (M04 vs. M05; M08 vs. M09) does not allow us to use document level information during training for part of the data.
An ensemble of different high performing models showed better results than the combination of the last check-points of the best model. Different combinations are reported in Table 2, all of them using a beam search of size 10 which also performed better than the default value of 6. The best ensemble comes from the combination of the four best performing individual models, but unfortunately the two best performing models were not ready at submission time. M11 and M12 are the same as M09 and M10 before convergence and were the ones used in the ensembled translation as our primary submission.

Corpus
Monolingual corpora. The monolingual corpora were used mainly as additional data for training word-embeddings in en and gu. For English we use the same NewsCrawl selection as for en-de (ssNewsCrawl). For Gujarati we use the 2018 version of NewsCrawl and CommonCrawl.
To further increase the available data size for training Gujarati embeddings as well as to add similar content to the English word embeddings, we crawled additional Gujarati news pages and, if existent, their English counterparts. This yielded an increase of about 2 M monolingual Gujarati sentences. While crawling for the news articles, articles written during the period from which the test corpus newstest2019 was created 8 were not included in the creation of these data sets. The number of sentences and tokens extracted from each news outlet is shown in Table 3.
Wikipedia (WP) is a popular source for comparable documents. In order to later extract paral-lel sentences from it, the WP dumps 9 for English and Gujarati are downloaded. Only the subset of articles that are linked across both languages using Wikipedia's langlinks are extracted. That is, an article is only taken into account if there is a linked article in the other language. For these purposes, we use WikiTailor ( Barrón-Cedeño et al., 2015) 10 to obtain the intersection of articles of both languages. We additionally use the en-gu WP reference which was made available for WMT 2019. The monolingual WP in Gujarati is added to the monolingual data for training the embeddings.
Parallel corpora. We use the concatenation of several parallel corpora available for the engu news translation task to train the base model. Firstly, the bible corpus 11 as well as two corpora specially made for WMT2019 12 are used, namely a crawled corpus (WMT19 Crawl) and a localisation corpus extracted from OPUS 13 (WMT Localisation). Lastly, the Translation Quality Estimation (TQE) dataset for Indian languages (Nisarg et al., 2018), which essentially is the concatenation of two corpora by the Indian Languages Corpora Initiative, which focus on the health and tourism domain each. For development, we use the first 999 sentences from the English-Gujarati version of newsdev2019. Further, we report results on the final newstest2019 corpus.
Pre-processing. All English corpora (excluding the evaluation corpora) undergo the same preprocessing. After being sentence split, the corpora are normalized, tokenized and truecased using standard Moses scripts (Koehn et al., 2007a). A byte-pair-encoding (BPE) (Sennrich et al., 2016b) of 40 k merge operations trained jointly on en-gu data respectively is applied accordingly. Duplicates are removed and sentences with more than 50 tokens are discarded. In order to enable a multilingual setup, language tokens indicating the designated target language are prepended to each source sentence. As the English-Gujarati setting is bilingual, this reduces to each Gujarati sentence starting with the language token <en>, and each English sentence with <gu>.
Gujarati corpora are normalized and romanized  using the Indic NLP Library. 14 The romanized corpora are then tokenized using Moses. As the romanization is case sensitive, no true-casing is performed. The shared BPE is applied. Cross-lingual word embeddings. We initialize the unsupervised NMT model using cross-lingual embeddings. These are trained using monolingual data only. For the English embeddings, we use ss-NewsCrawl, as well as the English crawled data. For Gujarati all Gujarati data available in Table 3 is used. The initial monolingual embeddings (of size 512) are trained using word2vec 15 . The two embeddings are then projected into a common multilingual space using vecmap 16 (Artetxe et al., 2017) . We extract all numerals that occur in both monolingual corpora in order to supply a small seed dictionary for training that is not linguistically motivated. After having projected the embeddings into the same space, they are merged into a single cross-lingual embedding. Whenever a word in the two languages is a homograph, one of the two was chosen randomly.

Neural Machine Translation System
For training our models, we use both SMT and a transformer architecture. While the SMT is used to provide a first model for back-translations as well as to train the final model submitted, the transformer is used in-between to extract additional data from Wikipedia.
The transformer is trained using OpenNMT-py (Klein et al., 2017) and is defined as follows: 6layer encoder-decoder with 8-head self-attention and 2048-dim hidden feed-forward layers. Adam optimization with λ=2 and beta2=0.998; noam learning rate decay (as defined in Vaswani et al. (2017)) with 8000 warm-up steps. Labels are smoothed ( =0.1) and a dropout mask (p=0.1) is applied. As is common for transformers, position encodings and Xavier parameter initialization (Glorot and Bengio, 2010) are used.

Statistical Machine Translation System
The second family of systems we use in this setting is statistical machine translation (SMT). We expect these systems to perform better when the number of parallel sentences is small. SMT systems are trained using standard freely available software. We estimate a 5-gram or 4-gram language model using interpolated Kneser-Ney discounting with SRILM (Stolcke, 2002) depending on the language and the size of the monolingual corpus. Word alignment is done with GIZA++ (Och and Ney, 2003) and both phrase extraction and decoding are done with the Moses package (Koehn et al., 2007b). The optimisation of the feature weights of the model is done with Minimum Error Rate Training (MERT) (Och, 2003) against the BLEU (Papineni et al., 2002) evaluation metric. Our model considers the language model, direct and inverse phrase probabilities, direct and inverse lexical probabilities, phrase and word penalties, and a lexicalised reordering.

Results
We train our SMT and NMT in four steps, yielding the following models: 1. SMT base : Train an SMT model on the concatenation of all parallel training data listed in Table 3 (∼194 k pairs). This is then used to back-translate 4 k (2 k per language direction) pairs of the monolingual data available.
2. NMT extract : Initialize Transformer with the pre-trained word-embeddings. The transformer is used to extract additional data from en-gu Wikipedias as well as the crawled  Zeenews and News18 articles. It is also used to filter the back-translations produced by SMT base as well as the parallel corpus available. The extraction is performed using the joint NMT learning and extraction framework described in Ruiter et al. (2019). There, we use the margin-based function (Artetxe and Schwenk, 2018) for scoring both word embedding and hidden-state representations. This results in an extracted and filtered corpus of ∼275 k sentences; a slight increase to the original parallel data available to us despite the filtering of less useful pairs.
3. SMT extract : SMT model, trained on the corpus that resulted from the extraction and filtering performed by NMT extract .
4. SMT all : SMT model, trained on both the extracted and filtered corpus by NMT extract , as well as the parallel data available, resulting in ∼475 k training pairs used.
Due to time constraints we could not apply any system combination technique on the individual systems. However, due to the big gap in performance between SMT and NMT we do not expect significant improvements. Table 4 shows translation quality as measured by BLEU for both the neural and statistical systems with the different data configurations.
The filtering and extraction performed by NMT extract led to a small increase in BLEU for SMT extract and SMT all , indicating that the filtering was based on positive decisions. However, when taking into account that the average number of extracted pairs from WP was steadily around 1.6 k pairs, and comparing them with the 18 k pairs in the en-gu WP reference, it becomes clear that extraction did not obtain high recall. This is most likely due to three difficulties that the system encounters in this setting: i) Not enough comparable data was available to adapt the internal representations (word embeddings and hidden states) to the data, meaning that the extraction performance, which is bound to the extraction decisions of the representations, stays below its potential. ii) The lack of monolingual data to train high-quality gu embeddings as well as iii) the rareness of homographs in this rather distant language pair makes the initialization difficult. Extraction in the first epochs is usually dependent on such homographs and a lack thereof reduces the number of identifiable pairs in the initialization phase of the model.

Conclusions
We presented two approaches for the WMT 2019 news translation shared task. We participated in the en2de task with a data-based coreferenceaware NMT system. The corpus is enriched with this document-level information at sentence level so that the standard training procedure can be used. However, the amount of data we can use is smaller than in the standard pipeline and therefore the global quality can be damaged. We expect the manual evaluation to show improvements on the tackled phenomena such as gender translation.
For the en-gu task, we used a NMT architecture that can be trained on comparable corpora. In this case we downloaded news web pages as well as linked Wikipedia articles in Gujarati and English to extract and train on. Our experiments show that very few sentences could be used from this corpus and our results are close to the baseline one can get with the available parallel resources. Given the final amount of data, our state-of-the-art SMT system performed clearly better than our NMT one.