SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task

This paper presents the description of 12 systems submitted to the WMT16 IT-task, covering six different languages, namely Basque, Bulgarian, Dutch, Czech, Portuguese and Spanish. All these systems were developed under the scope of the QTLeap project, presenting a common strategy. For each language two different systems were submitted, namely a phrase-based MT system built using Moses, and a system exploiting deep language engineering approaches, that in all the languages but Bulgarian was implemented using TectoMT. For


Introduction
The QTLeap 1 project focuses on the development of an articulated methodology for machine translation that explores deep language engineering approaches and sophisticated semantic datasets. The underling hypothesis is that the deeper the level of representation, the better the translation becomes since deeper representations abstract away from surface aspects that are specific to a given language. At the limit, the representation of the meaning of a sentence, and of all its paraphrases, would be shared among all languages. This purpose is supported by recent advances in terms of lexical processing. These advances have been made possible by enhanced techniques for referential and conceptual ambiguity resolution, and supported also by new types of datasets recently developed as linked open data.
The overall goal of the project is to produce quality translation between English (EN) and another language X by using deep linguistic information. All language pairs follow the same processing pipeline of analysis, transfer and synthesis (generation) and adopt the same hybrid MT approach of using both statistical as well as rulebased components in a tightly integrated way for the best possible results.
In this paper, we present the systems developed by the University of Basque Country for Basque and Spanish, Charles University in Prague for Czech, by University of Groningen for Dutch, by University of Lisbon for Portuguese and by IICT-BAS of the Bulgarian Academy of Sciences for Bulgarian.
For each language two different systems were submitted, corresponding to different phases of the project, namely a phrase-based MT system built using Moses (Koehn et al., 2007), and a system exploiting deep language engineering approaches, that in all the languages but Bulgarian was implemented using TectoMT (Žabokrtský and Popel, 2009). For Bulgarian, its second MT system is not based on TectoMT, but on exploiting deep factors in Moses. All 12 systems are constrained, that is trained only on the data provided by the WMT16 IT-task organizers.
We present briefly the Moses common setting and the TectoMT structure and then more detailed information for each language system are provided. In the last Section, results based on BLEU and TrueSkill are given and discussed.

Moses
All the systems submitted that were based on Moses have been trained on a phrase-based model by Giza++ or mGiza with "grow-diag-finaland" symmetrization and "msd-bidirectional-fe" reordering (Koehn et al., 2003). For the language pairs where big quantities of domain-specific monolingual data were available along with the generic domain data, separate language models (domain-specific and generic) were interpolated against our ICT domain-specific development set. For LM training and interpolation, the SRILM toolkit (Stolcke, 2002) was used. The method of truecasing has been adopted for several language pairs where it proved useful.

TectoMT
The deep translation is based on the TectoMT system, an open-source MT system based on the Treex platform for general natural-language processing. TectoMT uses a combination of rulebased and statistical (trained) modules (blocks in Treex terminology), with a statistical transfer based on HMTM (Hidden Markov Tree Model) at the level of a deep, so-called tectogrammatical, representation of sentence structure. The general TectoMT pipeline is language independent, and consists of analysis, deep transfer, and synthesis steps.
The design of TectoMT is highly modular and consists of a language-universal core and language-specific additions and distinguishes two levels of syntactic description: • Surface dependency syntax (a-layer) -surface dependency trees containing all the tokens in the sentence.
T-layer representations of the same sentence in different languages are closer to each other than the surface texts; in many cases, there is a 1:1 node correspondence among the t-layer trees. Tec-toMTs transfer exploits this by translating the tree isomorphically, i.e., node-by-node and assuming that the shape will not change in most cases (apart from a few exceptions handled by specific rules).
The translation is further factorized: t-lemmas, formemes, and grammatemes are translated using separate Translation Models (TM). The tlemma and formeme TMs are an interpolation of maximum entropy discriminative models (Max-Ent) (Mareček et al., 2010) and simple conditional probability models. The MaxEnt models are in fact an ensemble of models, one for each individual source t-lemma/formeme. The combined translation models provide several translation options for each node along with their estimated probability. The best options are then selected using a Hidden Markov Tree Model (HMTM) with a target-language tree model (Žabokrtský and Popel, 2009).
For this specific task, where we need to work on a specific domain, an extended version of Tec-toMT was used allowing interpolation of multiple TMs (Rosa et al., 2015). adaptation and MERT training. Batch2 domain corpus was used for testing during development.
The Moses system, EU-Moses, uses factored models to allow lemma-based word-alignment. After word alignment, the rest of the training process is based on lowercased word-forms and standard parameters: Stanford CoreNLP  and Eustagger (Alegria et al., 2002) tools are used for tokenization and lemmatization, MGIZA for word alignment with the "growdiag-final-and" symmetrization heuristic, a maximum length of 75 tokens per sentence and 5 tokens per phrase, translation probabilities in both directions, lexical weightings in both directions, a phrase length penalty, a "phrase-mslr-fe" lexicalized reordering model and a target language model. As for the language model, a 5-gram model was trained. The weights for the different components were adjusted to optimize BLEU using MERT tuning over the Batch1 development set, with an n-best list of size 100. For the TectoMT system, EU-Treex existing tools were used in order to get the a-layer. Eustagger is a robust and wide coverage morphological analyzer and POS tagger. The dependency parser is based on the MATE-tools (Bjrkelund et al., 2010). Basque models have been trained using the Basque Dependency Treebank (BDT) corpus (Aduriz et al., 2003). Transformation from the a-level analysis into t-level is partially performed with language-independent blocks thanks to the support of Interset (Zeman, 2008).
The English-to-Basque TectoMT system uses the PaCo2 and the Batch1 corpora to train two separate translation models, and they are used to create an interpolated list of translation candidates. In addition to that, the terminological equivalences extracted from the localization PO files (VLC, LO and KDE) as well as the domain terms extracted from Wikipedia are used to identify domain terms before syntactical analysis and to ensure domain translation on transfer. Finally, an extra module to treat non linguistic elements (URLs, shell commands, ...) has been used, to identify the elements that should be maintained untranslated on the output.

Bulgarian
Bulgarian team participated with two systems implemented using Moses: BG-Moses -a system that is based on standard factored Moses with fac-tors retrieved from POS tagged, lemmatized parallel corpora; and BG-DeepMoses -a system that also is based on standard factored Moses but the translation is done in two steps: (1) semanticsbased translation of the source language text to a mixed source-target language text which is then (2) translated to the target language via Moses. The latter system builds on Simov et al. (2015).
As training data for both systems the following corpora were used: the Setimes parallel corpus, the Europarl parallel corpus and a corpus created on the basis of the documentation of LibreOffice. The corpora are linguistically processed with the IXA 2 pipeline for the English part and the BTB pipeline for the Bulgarian. The analyses include POS tagging, lemmatization and WSD, using the UKB system, 3 which provides graph-based methods for Word Sense Disambiguation and lexical similarity measurements.
For the BG-Moses system, the following factors have been constructed: WordForm|Lemma|POStag.
For the BG-DeepMoses system, we exploited also the information from word sense annotation in order to predict some translations from English to Bulgarian based on the WordNet synsets and their mappings to the Bulgarian WordNet. Thus, we replaced the English word form with a representative lemma in Bulgarian. The motivation for using representative lemmas in Bulgarian is as follows: we aim at unifying the various synsets with similar translations in the Bulgarian language. After the creation of this intermediate English/Bulgarian text, we trained Moses with the following factors: ENWordForm-BGLemma|Lemma|BGPOStag, where ENWordForm-BGLemma is an English word form when there is no appropriate Bulgarian one, or the Bulgarian lemma; BGPOStag is the appropriate Bulgarian tag representing grammatical features like number, tense, etc.

Czech
The Czech Moses system follows the CU-Bojar system (Bojar et al., 2013). A factored phrasebased model was trained based on truecased forms translated directly to the pair <truecased form, morphological tag>. There were three LMs for Czech: • 8grams of morphological tags from the monolingual part of news and political corpora, • 6grams of forms from the monolingual part of news and political corpora and • 6grams from the Czech side of a bilingual Czech-English corpus CzEng.
The pre-processing of this SMT system has been harmonized with the pre-existing version of Tecto-MT: Tokenization and lemmatization is handled by Treex followed by further tokenization at any letter-digit-punctuation boundary. Additionally, casing is handled by a Czech-specific supervised truecasing method. The output of the lemmatizer is used, as names have lemmas capitalized, the casing of the lemma is cast to the token (lowercasing non-names at sentence beginnings, lowercasing also ALL CAPS if correctly lemmatized). Finally, the translation is done using casesensitive tokens and finally the first letter in every sentence is only capitalized.
The TectoMT analysis pipeline is based on the annotation pipeline of the CzEng 1.0 corpus (Bojar et al., 2012) starting with a rule-based tokenizer and a statistical part-of-speech tagger (Straková et al., 2014) and dependency parser (McDonald et al., 2005;Novák andŽabokrtský, 2007). These steps result in a-layer trees, which are then converted to t-layer using a rule-based process.
The English-to-Czech transfer uses a combination of translation models and tree model reranking. The Czech synthesis pipeline has remained basically unchanged since the original TectoMT system (Žabokrtský et al., 2008).

Dutch
The Moses system for Dutch was trained on the third version of the Europarl corpus (Koehn, 2005) and the in-domain KDE4 Localization data (Tiedemann, 2012). Words are aligned with GIZA++ and tuning was done with MERT. The applied heuristics for the Dutch baselines were set to "grow-diag-final-and" alignment and "msdbidirectional-fe" reordering. For the creation of the language models, IRSTLM was used to train a 5-gram language model with Kneser-Ney smoothing on the monolingual part of the training corpora.
For the TectoMT system, the analysis of Dutch input uses the Alpino system (Noord, 2006), a stochastic attribute value grammar. The transfer uses discriminative (context-sensitive) and dictionary translation models. In addition, a few rulebased modules are employed that handle changes in t-tree topology and Dutch grammatical gender.
The Dutch synthesis pipeline includes morphology initialization and agreements (subjectpredicate and attribute-noun), insertion of prepositions and conjunctions based on formemes, and insertion of punctuation, possessive pronouns and Dutch pronominal adverbs. The t-tree resulting from the transfer phase is first converted into an Abstract Dependency Tree (ADT) using rulebased modules implemented in Treex. The ADT is then passed to the Alpino generator (de Kok and Noord, 2010), which handles the creation of the actual sentence including inflected word forms.

Spanish
The Moses system developed for the translation from English to Spanish, ES-Moses, uses standard parameters: tokenization and truecasing using tools available in Moses toolkit, MGIZA for word alignment with the "grow-diag-final-and" symmetrization heuristic, a maximum length of 80 tokens per sentence and 5 tokens per phrase, translation probabilities in both directions with Good-Turing discounting, lexical weightings in both directions, a phrase length penalty, an "msdbidirectional-fe" lexicalized reordering model and a 5-gram target language model. The weights for the different components were adjusted to optimize BLEU using MERT tuning over the Batch1 development set, with an n-best list of size 100.
The English-to-Spanish TectoMT, ES-Treex, system uses the Europarl and the Batch1 corpora to train two separate translation models, and these were used to create an interpolated list of translation candidates. In addition to that, the terminological equivalences extracted from the localization PO files (VLC, LO and KDE) as well as the domain terms extracted from Wikipedia are used to identify domain terms before syntactic analysis and to ensure domain translation on transfer. Finally, an extra module to treat non linguistic elements (URLs, shell commands, ...) has been used to identify the elements that should be maintained untranslated on the output.
Both systems were trained using the same training corpora: the 7th version of the Europarl corpus was used for both translation and language mod-eling, and the in-domain batch1 corpus was used for domain adaptation and MERT training. The Batch2 domain-specific corpus was used for testing during development. We have not used all the available parallel corpora, because of the computational restrictions in analyzing all those corpora at the tectogrammatical level of the TectoMT system.

Portuguese
The Moses system for the translation from English to Portuguese, PT-Moses, was obtained by using the default parameters and tools regarding the training of a phrase-based model. For the pre-processing, a sentence length of 80 words was used and the tokenization was performed by the Moses tokenizer. No lemmatization or compound splitting was used and the casing was obtained with the Moses truecaser. For the training, a phrase-based model was used with a language model order of 5, with Kneser-Ney smoothing, which was interpolated using the SRILM tool. The word alignment was done with Giza++ on full forms and the final tuning was done using MERT. The Europarl corpus was used for the training data, both as monolingual data for training language models and as parallel data for training the phrase-table.
Regarding the English-to-Portuguese TectoMT system (Silva et al., 2015) (Rodrigues et al., 2016a), PT-Treex, in order to get the a-layer the Portuguese system resorted to LX-Suite (Branco and Silva, 2006), a set of pre-existing shallow processing tools for Portuguese that include a sentence segmenter, a tokenizer, a POS tagger, a morphological analyser and a dependency parser, all with state-of-the-art performance. Treex blocks were created to be called and interfaced with these tools.
After running the shallow processing tools, the dependency output of the parser is converted into Universal Dependencies (UD) (de Marneffe et al., 2014). These dependencies are then converted into the a-layer tree (a-tree) in a second step. Both steps are implemented as rule-based Treex blocks. Converting the a-tree into a t-layer tree (t-tree) is done through rule-based Treex blocks that manipulate the tree structure.
The transfer phase is handled by a tree-to-tree maximum entropy translation model (Mareček et al., 2010) working at the deep syntactic level of tectogrammatical trees. Two separate models were trained and interpolated, the first model with over 1.9 million sentences from Europarl (Koehn, 2005) and the second model composed of the Batch1, the Microsoft Terminology Collection and the LibreOffice localization data (Štajner et al., 2016). Each pair of parallel sentences, one in English and one in Portuguese, are analyzed by Treex up to the t-layer level, where each pair of trees are fed into the model.
The TectoMT synthesis (Rodrigues et al., 2016b) included other two lexical-semanticsrelated modules, the HideIT and gazetteers. The HideIT module handles entities that do not require translation such as URLs and shell commands. The gazetteers are specialized lexicons that handle the translation of named entities from the ITdomain such as menu items and button names.
Finally, synset IDs were used as additional contextual features in the lemma-to-lemma Discriminative Translation Models (Neale et al., 2016). Table 1 presents the results of automatic and manual evaluation, based on BLEU and TrueSkill 4 scores respectively. For 4 of the 6 languages, the TectoMT-based system performs better than the Moses-based one when considering both BLEU and TrueSkill scores. For Bulgarian, the BG-DeepFMoses performs worse than the BG-FMoses on both scores. For Dutch, the Moses system outperforms the TectoMT only when considering the BLUE score, but not the TrueSkill score.

Results
Regarding Bulgarian, although BG-Deep-FMoses system performed worse than BG-Moses, the automatic conversion of the source text into near-target language text represents a promising direction for further improvement of the English-to-Bulgarian MT system. We assume that the current drop might be overcome by improving the WordNet information for Bulgarian, its mapping to the English WordNet as well as the processing pipelines. Also, we plan to train this system on more data and to exploit other bilingual dictionaries.
For the English→Dutch translation direction, the Moses system outperforms TectoMT in terms of BLEU score. The results of the manual evaluation, however, are in favor of the TectoMT sys- tem. This difference may in part be caused by the fact that BLEU only scores exact word or phrase matches and the TectoMT output shows more lexical flexibility as compared to Moses. We get better results, in terms of BLEU-score, in the opposite translation direction which indicates that more effort should be put into this translation direction. Our focus here lies on the Dutch synthesis pipeline where we still need to fix some basic errors. Also we intend to implement more modules that are based on lexical semantics. We also presented at the IT-task a third system for Czech, Dutch, Spanish and Portuguese, called Chimera that combines Moses and Tec-toMT (Rosa et al., 2016).