Why Catalan-Spanish Neural Machine Translation? Analysis, comparison and combination with standard Rule and Phrase-based technologies

Catalan and Spanish are two related languages given that both derive from Latin. They share similarities in several linguistic levels including morphology, syntax and semantics. This makes them particularly interesting for the MT task. Given the recent appearance and popularity of neural MT, this paper analyzes the performance of this new approach compared to the well-established rule-based and phrase-based MT systems. Experiments are reported on a large database of 180 million words. Results, in terms of standard automatic measures, show that neural MT clearly outperforms the rule-based and phrase-based MT system on in-domain test set, but it is worst in the out-of-domain test set. A naive system combination specially works for the latter. In-domain manual analysis shows that neural MT tends to improve both adequacy and fluency, for example, by being able to generate more natural translations instead of literal ones, choosing to the adequate target word when the source word has several translations and improving gender agreement. However, out-of-domain manual analysis shows how neural MT is more affected by unknown words or contexts.


Introduction
Machine Translation (MT) is the application that allows to translate automatically from one source language to a target language. Approaches vary from rule-based to corpus-based. Rule-based MT systems have been the first largely commercialized MT systems (Douglas Arnold and Lorna Balkan and R. Lee Humphreys and Siety Meijer and Louisa Sadler, 1994). Years later, corpusbased approaches have reached both the interest in the scientific and industrial community (Hutchins, 1986). Recently, neural MT approach has been proposed. This corpus-based approach uses deep learning techniques (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014) and it may be taking over previous popular corpusbased approaches such as statistical phrase or hierarchical-based (Koehn et al., 2003;Chiang, 2007). As a result, large companies, such as Google, have been using rule-based MT, then statistical MT and just very recently, they are replacing some of their statistical MT engines by neural MT engines (Wu et al., 2016). This paper analizes how standard neural MT techniques, which are briefly described in section 4.3, perform on the Catalan-Spanish task compared to popular rule-based and phrase-based MT. Additionally, we perform a naive system combination using the standard Minimum Bayes Risk (MBR) technique (Ehling et al., 2007) which reports slight improvements, in terms of standard automatic measures, in in-domain test set but large improvements in out-of-domain test set.
Catalan and Spanish are closely-related languages, which make them particularly interesting for MT and translation performance is quite high for rule-based and statistical-based systems.
Given these similarities, we want to test how neural MT behaves on such related language pairs. This leads us to the main question that this paper tries to solve: Is neural MT competitive with current wellperforming rule-based and phrase-based MT systems?
The answer to this question will be specially useful to industry, since they may decide over results shown if it is worth it to change their current paradigm which may be either rule or statistical or a combination of both. The aim of this study is to offer a comparison over these systems in terms of translation quality, terms of efficiency or computational cost are out-of-the-scope of this paper.
In this sense, the main contribution of this paper is the analysis and discussion on how the new neural MT approach addresses Catalan-Spanish MT compared to state-of-the-art systems and what are the remaining challenges for this particular language pair.
The rest of this paper is structured as follows. The next section briefly reports on the related work. Section 3 analyses details of this language pair. Section 4 briefly describes each MT approach: rule, phrase and neural-based, respectively. Section 5 details the experimental framework both in data description and in system parameters. Section 6 compares systems based on both automatic and manual analysis and discusses results. Finally, Section 7 reports the main conclusions of this paper.

Related work
Previous related publications on the Catalan-Spanish language pair are in rule-based MT (Canals-Marote et al., 2001;Alonso, 2005) and statitical MT (Poch et al., 2009;Costa-jussà et al., 2012). It is worth noting that given the similarity among Catalan and Spanish, Vilar et al (2007) proposed to build a statistical MT system that translated letters, whose underlying idea is similar to recent approaches in neural MT that are characterbased (Costa-jussà and Fonollosa, 2016). As far as we are concerned, there are no previous works in neural MT covering Catalan-Spanish language pair.

Catalan and Spanish languages
This section reviews several aspects of the language pair we are addressing as a motivation of our study. We point out several social aspects covering language speakers and countries as well as commenting on situations of bilingualism. We also report linguistic aspects of both languages.

Social aspects
There are around 470 million native speakers for Spanish compared to 4 million for Catalan (as claimed in the Wikipedia). As a consequence, resources for Spanish are much larger than resources Catalan-Spanish bilingualism only occurs in the regions of Spain and in Andorra. The tendency is that all Catalan native speakers, in practice, also speak Spanish. However, it is not the same for Spanish native speakers. This leads us to a first example of use case for an MT system for this language pair: Spanish (native) speakers that do not understand Catalan. Other use cases include professional translations or web page translations.

Linguistic aspects
Catalan and Spanish belong to the romance languages which are the modern languages that evolved from Latin. Since both languages are from the same linguistic family, both share similar linguistic features such as morphological inflections or word reordering. Translation between both languages is quite straightforward since there are very few word reorderings and both vocabulary sizes and morphology inflection are quite similar.

MT Approaches
This section briefly reports standard baseline architectures for rule-based, phrase-based and neural-based MT. Description for all systems is done in a generic way, particular details from each one used in this work are described later in section 5.2. It is worth mentioning that the rulebased system significantly differs from the other two systems because it is not corpus-based. And  phrase-based and neural-based, although both being corpus-based, they manage data very differently. The phrase-based system uses frequency counts and the neural-based system uses nonlinear transformations. The main advantage from corpus-based approaches over the rule-based is that they learn from data. While the main advantage of neural-based over phrase-based is that the architecture allows for an end-to-end optimization.

Rule-based MT
Rule-based MT combines dictionaries and handmade rules to generate the target output given the source input. Generally, a morphological and syntactic analysis of the source input is needed before doing the transfer into a simplified target. The final target is generated adding the appropriate morphology and/or syntax. See Figure 2 for an schematic representation of this approach.

Phrase-based Statistical MT
Standard phrase-based statitical MT (Koehn et al., 2003) focuses on finding the most probable target text given the source text by means of probabilistic techniques. Given a parallel corpus at the level of sentences, statistical co-ocurrences are studied to extract a bilingual dictionary of sequences of words (phrases) which are ranked using several features (i.e. conditional and posterior probabilities). Additionally to this bilingual dictionary, which is considered the translation model, other models such as reordering or language models are trained. Note that language modeling is trained on monolingual corpus and it gives information about the fluency of a sentence in the target language. All models are combined in the decoder which uses a beam search to extract the most probable target output given a source input. Note that the system is optimized in several steps since the word alignment is determined before building the translation model. See Figure 3 for an schematic representation of this approach.

Neural MT
Neural MT computes the conditional probability of the target sentence given the source sentence by means of an autoencoder architecture (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;. First, the encoder reads the source sentence (s 1 , s 2 ..., s N ) of N words, the encoder does a word embedding (e 1 , e 2 , ...e n ) and encodes it into an intermediate representation (also refered to as context vector) by means of a recurrent neural network, which uses the gated recurrent unit (GRU) as activation function. The GRU function allows for a better performance with long sentences. Then, the decoder, which is also a recurrent neural network, generates a cor-responding translation (t 1 , t 2 ...t M ) of M words based on this intermediate representation. Both encoder and decoder are jointly trained using the common statistical technique of Maximum (log-)likelihood Estimation (MLE).
This baseline autoencoder architecture is improved with an attention-based mechanism , in which the encoder uses a bidirectional recurrent neural network. Now, the decoder predicts each target word with the intermediate representation plus the information of context given by the attention. See Figure 4.

Experimental Framework
This section reports details on the data used for training, optimizing and testing as well as a description of the parameters for each system in the comparison.

Data
We use a large corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper, El Periódico (Costa-jussà et al., 2014). The Spanish-Catalan corpus is partially available via ELDA (Evaluations and Language Resources Dis-tribution Agency) in catalog number ELRA-W0053. Development and test sets are extracted from the same corpus, but additionally, to test system performance in out-of-domain, we use a test corpus within the medicine domain. This medical corpus was kindly provided by the Universal-Doctor project 1 . Preprocessing was limited to tokenization. Corpus statistics are shown in Table  1.

System details
Rule-based We use the Apertium rule-based system (Forcada et al., 2011). Apertium is opensource shallow-transfer MT system which was initially designed for the translation between related language pairs. In particular, this rule-based system does not do full syntactic parsing in contrast to the general rule-based architecture described in section 4.1. The system is available from Sourceforge 2 , and we use its last version 1.2.1.
Phrase-based We use Moses (Koehn et al., 2007) which is an open-source phrase-based MT system and it has a large community of developers behind. To build the system, we use stan-dard/default parameters which include: growdiagonal-final-and word alignment symmetrization, lexicalized reordering, relative frequencies (conditional and posterior probabilities) with phrase discounting, lexical weights, phrase bonus, accepting phrases up to length 10, 5-gram language model with Kneser-Ney smoothing, word bonus and MERT (Minimum Error Rate Training) optimisation.
Neural-based The neural MT system was built using the open-source software available in github 3 . This code implements the auto-encoder with attention that we presented in section 4.3. We use the parameters defined in Table 2. Regarding vocabulary limitation, we use a vocabulary size of 90,000 both in Spanish and in Catalan. We replace out-of-vocabulary words (UNKs) using the standard methodology (Jean et al., 2015): we use the word-to-word translation model learned with 'fast-align' (Dyer et al., 2013) or, if not available, the aligned source word is used. We use an embedding of 512 and a dimension of 1024, a batch size of 32, and no dropout, learning-rate of 0.001 and adadelta optimization.

Results
This section evaluates the three systems in terms of standard automatic measures. Then, we show some examples of translation outputs and we do a manual comparison. Table 3 shows results in terms of METEOR (Lavie and Agarwal, 2007) and BLEU (Papineni et al., 2002). The best results for the in-domain test set are achieved when using the neural MT system for both translation directions. Best results for the out-of-domain corpus vary depending on the translation direction and measure: for Catalan-to-Spanish, best results are obtained with the phrasebased system; and for Spanish-to-Catalan, best results are obtained with the rule-based system in terms of BLEU, but with the phrase-based system in terms of METEOR. In all cases, results are statistically significant (99%) following the "pair bootstrap resampling" (Koehn, 2004).

Automatic measures
To summarise, neural MT is significantly better in the in-domain translation, but it is left behind in out-of-domain. In this out-of-domain task, rule-  based becomes competitive with corpus-based approaches.
As expected, a simple naive system combination like MBR provides the best final translation results. This means that systems can complement each other, specially for the out-of-domain test set.

Manual analysis
Manual analysis in this section is intended to complement information provided by the automatic measures in previous section. Table 4 shows several translation examples from the three systems for the in-domain test set. Examples show the advantages of the neural MT system compared to rule and/or phrase-based systems. Coherently with previous automatic results, neural MT shows best results. Each example in Table 4 specifically shows how neural MT is able to improve translation in the following terms: 1. Better gender agreement (compared to phrase-based MT), which clearly affects fluency of the final translation.
2. No missing content words (compared to phrase-based MT) and using the right verb tense (compared to the rule-based), which has an impact in adequacy of the translation.
3. Avoiding redundant words like "botar" produces a better translation since this would not sound fluent in this context in Catalan.
4. Choosing the right translation from a polysemic word improves adequacy and fluency at the same time, the verb "ser" in Catalan has mainly two different translations in Spanish which are "ser" o "estar", in this case, the correct one is the latter.

5.
Avoiding using literate translation, if possible, improves translation, in particular, the obligation "s'ha de" in Catalan has to be translated to "hay que" or "han tenido que" in Spanish.
7. Adding words to make translation more fluent. The use of "cuyas" which improves translation.
Finally, example 8 shows the main mistake that neural MT does systematically for this pair of languages: missing initial determiners. Table 5 shows examples in the out-of-domain text. In this case, example 1 shows how the neural MT system correctly uses the pronoun but it does not coincide with the reference. Example 2, neural MT uses the wrong translation of "pedir" which would correspond to a correct translation in some contexts of the training material. Examples 3 shows how a new unnecessary (but also correct) word is added to the translation in the case of the neural MT. Finally, example 4 shows a missing translation of a word, which is an outof-vocabulary.
Most of neural MT errors could be addressed by using already existing techniques. The example of missing determiners could be solved using   coverage neural MT (Tu et al., 2016); wrong translations may be reduced using a language model (Gulcehre et al., 2017); and out-of-vocabulary words may be reduced using existing approaches such as Byte Pair Encoding (BPE) (Sennrich et al., 2016) or character-based (Costa-jussà and Fonollosa, 2016). The integration of these new advances for Catalan-Spanish language pair is left for future work.

Discussion and Further Work
This paper shows a comparison between rule, phrase and neural MT systems in the Catalan-Spanish language pair. Performance is better in the case of the neural MT system when using the in-domain test set, but best performance in the out-of-domain test set is better for the rule-based system (Spanish-to-Catalan, in BLEU) and for the phrase-based sytem (Catalan-to-Spanish).
Regarding our research question: Is neural MT competitive with current well-performing rulebased and phrase-based MT systems? Based on the automatic and manual analysis from this paper, the answer is yes, specially, for in-domain sets. Therefore, it is worth it to use neural MT for Catalan-Spanish when building domain spe-  cific translation systems. And it is worth it to use system combination for the out-of-domain case. Again, mention that we do not consider efficiency and computational cost comparison in this study.
In this paper, we only implemented a baseline neural MT. Further work would be to show how recent improvements in neural MT like the ones mentioned in previous: Byte Pair Encoding (BPE) (Sennrich et al., 2016), character-based (Costajussà and Fonollosa, 2016), coverage (Tu et al., 2016), language model (Gulcehre et al., 2017), multilingual  and other strategies (Wu et al., 2016) affect this language pair.