The TALP–UPC Spanish–English WMT Biomedical Task: Bilingual Embeddings and Char-based Neural Language Model Rescoring in a Phrase-based System

This paper describes the TALP–UPC sys-tem in the Spanish–English WMT 2016 biomedical shared task. Our system is a standard phrase-based system enhanced with vocabulary expansion using bilingual word embeddings and a character-based neural language model with rescoring. The former focuses on resolving out-of-vocabulary words, while the latter enhances the ﬂuency of the system. The two modules progressively improve the ﬁnal translation as measured by a combination of several lexical metrics.


Introduction
Machine Translation (MT) has been evolving in recent years achieving successful translations as shown by international evaluations such as WMT 1 and increasing use of MT in commercial applications. However, specific domains like legal, biomedical, etc., still lag behind the state-of-the-art MT systems. This can mostly be attributed to the lack of available corpora. The new biomedical task from WMT 2016 especially helps in improving our understanding in this direction.
In this paper, we describe our participation in the WMT 2016 biomedical task. We participated with a phrase-based SMT system enhanced with bilingual word embeddings and a character-based neural language model. Section 2 presents some related work to our approach. Next, Section 3 introduces the theoretical aspects of the system components and Section 4 the experiments. Finally, we justify our choice for the final submission and draw the conclusions in Section 5. 1 http://www.statmt.org/wmt16

Related Work
In this paper, we are interested in research in the area that target OOVs and approaches to re-rank n-best lists of translations.
Our work closely follows Vulic and Moens (2015) and Zhao et al. (2015) in spirit, where word vectors are used to induce bilingual lexicons of words or phrases. We go a step further and build lexicons from bilingual word embeddings to be later used within an SMT system.
There is also a rich body of recent literature that focuses on obtaining bilingual word embeddings using aligned corpora (Bhattarai, 2012;Gouws et al., 2015;Kočiskỳ et al., 2014). We approach the problem differently and obtain embeddings separately on monolingual corpora and then use supervision in the form of a small sparse bilingual dictionary. This is similar to Mikolov et al. (2013b), who obtain monolingual embeddings for both the languages separately and then learn transformation for projecting the embeddings of words onto embeddings of the word translation pairs using a big bilingual dictionary.
On the other hand, there have been several language models used for rescoring in SMT. For example, neural feed-forward language models (Schwenk et al., 2006) have been used to rescore both n-gram-based and phrase-based systems. Mikolov (2012) re-ranks n-best lists with recurrent neural networks. Vaswani et al. (2013) combine feed-forward language models, with rectified linear units and noise-contrastive estimation. Luong et al. (2015) propose to use deeper neural models which improve re-ranking. In this paper, we are using Kim et al. (2016) a characterbased language model to re-rank the output of the phrase-based system.

The Translation System
The TALP-UPC translation system is built on three different components. We describe their theoretical basis in the following subsections.

Phrase-based SMT
The standard phrase-based machine translation system (Koehn et al., 2003) focuses on finding the most probable target sentence given the source sentence. The phrase-based system has evolved from the noisy-channel to the log-linear model which combines a set of feature functions in the decoder, including the translation and language model, the reordering model and the lexical models. Although the phrase-based system is a commoditized technology used at the academic and commercial level, there are still many challenges to solve, such as OOVs.

Vocabulary Expansion using Bilingual Word-Embeddings
We look at this task as a bilinear prediction task as proposed by (Madhyastha et al., 2014). The proposed model makes use of word embeddings of both languages with no additional features. The basic function is formulated -the probability of a target word given a source word-as log-linear model and takes the following form: Where φ(.) denotes the n-dimensional distributed representation of the words, and we assume we have both source (φs) embeddings and target (φt) embeddings.
Essentially, our problem reduces to: a) first getting the corresponding word embeddings of the vocabularies on both the languages on a significantly large monolingual corpus and b) estimating W given a relatively small dictionary. To learn W we use the source word to target word dictionaries as training supervision.

Character-based Neural Language Model
Language models based on Recurrent Neural Networks are currently one of the best performing approaches in terms of perplexity (Mikolov et al., 2010). They are also a good re-ranking option in tasks such as speech recognition and machine translation. However, the standard lookup-based word embeddings are limited to a finite-size vocabulary for both computational and sparsity reasons. Moreover, the orthographic representation of the words is completely ignored. The standard learning process is blind to the presence of stems, prefixes, suffixes and any other kind of affixes in words.
As a solution to those drawbacks, new alternative character-based word embeddings have been recently proposed for tasks as language modeling (Kim et al., 2016;Ling et al., 2015), parsing (Ballesteros et al., 2015) or part-of-speech tagging (Ling et al., 2015;Santos and Zadrozny, 2014). For our system we selected the best characterbased embedding architecture proposed by Kim et al. (Kim et al., 2016). The computation of the representation of each word starts with a characterbased embedding layer that associates each word (sequence of characters) with a sequence of vectors. This sequence of vectors is then processed with a set of 1D convolution filters of different lengths (from 1 to 7 characters) followed with a max pooling layer and two additional highway layers. The output of the second highway layer provide us with the final vector representation of each source word that replaces the standard source word embedding in the recurrent neural network used for language modeling (Kim et al., 2016).

Data
Our main corpus is the compilation of the corpora assigned for the shared task, which was built using scientific publications gathered from the Scielo database. We focus on the Spanish-English language pair, for which the size of the corpora is summarised in Table 1. We further increase the vocabulary of the system by using standard parallel corpora for the Spanish-English language pair (i.e., UN corpora, Europarl corpora, News corpus, etc. 2 ). This corpus appears as Quest in Table 1.
For the monolingual corpus we use an English and Spanish Wikipedia dump 3 .
The corpora has been pre-processed with a standard pipeline for both Spanish and English: tokenizing and keeping parallel sentences between 1 and 80 words. Additionally, for Spanish we used Freeling (Padró and Stanilovsky, 2012) to tokenize pronouns from verbs (i.e. comenzándose to comenzando + se), we also split prepositions and articles, i.e. del to de + el and al to a + el. This was done for similarity to English.
We divided the provided parallel corpus into training, development and test sets. Sentences from development and test set were taken randomly, proportionally to the amount of Medline and Scielo (biomedical and health) sources and only from unique parallel sentences.
Since the domain of the test set is the same as the domain of training corpus, the number of OOV words is small. Table 2 shows the total number and percentage of unknown words in our in-house development and test sets with respect to translation tables (see the following section). For comparison, we also include the figures for the two test sets made available for the final evaluation.

System Description
As introduced in the previous section, three different modules build our system: the SMT engine, the module to resolve OOVs and the module for re-reranking.
SMT Engine. Three different state-of-the-art phrase-based SMT translation systems are trained on the parallel corpora detailed in Table 1. For the purely in-domain system, we use only the biomedical data made available for the task (STT systems, small translation table). For more general systems, we also use the Quest data; we name these systems BTT (big translation table).
For the in-domain system, a 5-gram language model is estimated on the target side of the corpus using interpolated Kneser-Ney discounting with SRILM (Stolcke, 2002) (SLM, small language model). For the extended systems, we use all the monolingual corpora available and the target side of the large parallel corpus (BLM, big language model). Word alignment is done with GIZA++ (Och and Ney, 2003) and both phrase extraction and decoding are done with the Moses package (Koehn et al., 2007). The optimisation of the weights of the model is trained with MERT (Och, 2003) against the BLEU (Papineni et al., 2002) evaluation metric on devBio.
OOVs resolution. This module first obtains bilingual embeddings from the monolingual ones as explained in Section 3.2. For estimating monolingual word vector models, we use the CBOW algorithm as implemented in the Word2Vec package (Mikolov et al., 2013a) using a 5-token window. We obtain 300 dimension vectors for English and Spanish from the monolingual and the source side of the parallel corpora in Table 1. The bilingual counterpart has been estimated using 34,806 words from the Apertium bilingual dictionary 4 as seed lexicon divided for training and validation. Each bilingual pair has an associated probability given by Eq. 1. We keep the top-10 pairs for each out-of-vocabulary word in the test (development) set and include these new translation options at decoding time. Since we are only dealing with OOVs, the new options do not interact with the other phrase pairs in the translation table, but there is interaction with the language model.
Re-ranking. The 1000-best list of translations given by the SMT engine is re-ranked using the characted-based language model described in Section 3.3. It has 1D convolutional filters of width [1,2,3,4,5,6,7] and size [50,100,150,200,200,200,200] for a total of 1,100 filters with a tanh activation, 2 highway layers with a ReLU activation, and 2 LSTM with 650 hidden units. The network has been trained on the monolingual part of the indomain data (Biomedical corpus in Table 1).

Results
We evaluate the performance of each module when added to the three standard SMT systems built with different amount of training data (STTSLM, STTBLM, BTTBLM). In the following, we denote the module for OOV resolution with oov and the module for re-reraking with reranked. For the total.reranked system, we reranked the n-best lists for the thirteen systems with our neural language model. We conduct the evaluation automatically with a set of lexical metrics calculated with the Asiya toolkit 5 (Giménez and Màrquez, 2010). Table 3 reports the results for the English-to-Spanish translation systems and Table 4 for the Spanish-to-English ones.
The first thing to notice is that the best translation is obtained when only in-domain data are used to build the translation model. This is true in both directions. When going from Spanish into English, we obtain 0.45 BLEU points of improvement when adding the oov module to the in-domain system (STTSLM.oov) and an additional 0.15 with the re-ranking module (STTSLM.oov.reranked). Even if the number of OOV is only a 0.09% in this test set, the improvement with this module is consistent through all metrics. The main reason is that making available new translation options at decoding time allows the language model to modify the sentence as a whole, and the neighbouring words can be modified accordingly.
In the English-to-Spanish direction, the trends are less homogeneous through the set of metrics. For BLEU and METEOR (with the stemming variant, MTRst), the best system is still STTSLM.oov. However, with NIST and TER, the best system is STTBLM. In this case, enlarging the language model has a similar effect as injecting 5 http://nlp.cs.upc.edu/asiya new vocabulary through OOV translations. This is because only a 31% of the OOV belong to the biomedical domain, suggesting that in this case and for an in-domain test set, it is important to gain fluency on the general domain phrases. The effect of the re-ranking module is more evident in this direction: the more data one uses, the more distinct the final n-best list is and the more improvement one can obtain. For the in-domain system the reranking is not promoting a better translation, but for the general system the improvement is significant.

Conclusions
We have built thirteen translation systems per direction. The ones chosen for the final submission follow two criteria: i) they have a top performance according to BLEU and METEOR (the official metrics) and, ii) they allow us a coherent comparison among languages and methodologies. With this criteria, our primary submission both for the health and biological test sets is the strictly in-domain system with the OOV module (STTSLM.oov). For comparison, we also submitted our baseline as a second run: the same system without the OOV module (STTSLM). Finally, we submitted as third run a system with re-ranking of a 1000-best list. Due to time constraints, we could not submit the system that re-ranks all the n-best lists for the thirteen systems, total.reranked, but we used instead the two most promising options per direction.
According to the preliminary results of the shared task, the OOV module consistently improves the translations with respect to our baseline specially in the health subdomain as measured by BLEU. The effect is similar to the results in our in-house test set. On the other hand, the reranking module is also always better than the indomain phrase-based baseline and, in this case, the performance on the competition test set is significantly better than the one in our test set, espe-  cially for English-to-Spanish. Run 3, the system that includes re-ranking with a char-based neural language model, is 2 points of BLEU over the average value among participants in the biological subdomain and 1 point of BLEU on the health subdomain.