Ongoing Study for Enhancing Chinese-Spanish Translation with Morphology Strategies

Chinese and Spanish have different morphology structures, which poses a big challenge for translating between this pair of languages. In this paper, we analyze several strategies to better generalize from the Chinese non-morphology-based language to the Spanish rich morphology-based language. Strategies use a ﬁrst-step of Spanish morphology-based simpliﬁca-tions and a second-step of fullform generation. The latter can be done using a translation system or classiﬁcation meth-ods. Finally, both steps are combined either by concatenation in cascade or integration using a factored-based style. On-going experiments (based on the United Nations corpus) and their results are described.


Introduction
The structure of Chinese and Spanish differs at most linguistic levels, e.g. morphology, syntax and semantics. In this paper, we are focusing on reducing the gap between both languages at the level of morphology. On the one hand, Chinese is an isolating language, which means having a low morpheme per word ratio. On the other hand, Spanish is a fusional language, which means having a tendency to overlay many morphemes. The challenge when translating between Chinese and Spanish is bigger in the direction from Chinese to Spanish, given that the same Chinese word can generate multiple Spanish words. For example, the Chinese word fàn (in transcribed Pinyin) can be translated by comer, como, comí, comeré 1 which correspond to several tense flexions of the same verb and also by comes, comiste, comerás 2 , 1 to eat, I eat, I ate, I will eat 2 you eat, you ate, you will eat all of which also correspond to several person flexions of the same verb. This poses a challenge in Statistical Machine Translation (SMT) because translations are learnt by co-ocurrence of words in both languages. When a word has multiple translations, it generates sparsity in the translation model.
In this study, we experiment with different strategies to add morphology knowledge in a standard phrase-based SMT system (Koehn et al., 2003) for the Chinese-to-Spanish translation direction. However, the presented techniques could be used for other pairs involving isolating and fusional languages. The rest of the paper is organized as follows. Section 2 reports a brief overview of the related work both in using morphology knowledge in SMT and in translating from Chinese-to-Spanish. Section 3 explains the theoretical framework of phrase-based SMT at a high level and the details of each strategy to introduce morphology in the mentioned system. Section 4 describes the experiments and first results obtained for each theoretical strategy presented. Finally, Section 5 concludes this ongoing research and outlines the future research directions.

Related Work
There are numerous studies which deal with morphology in the field of SMT. Without aiming at completeness, we cite works that: • Preprocess the data to make the structure of both languages more similar by means of enriching (Avramidis and Koehn, 2008;Ueffing and Ney, 2003) or segmentation techniques in agglutinative (S. Virpioja et al., 2007) or fusional languages (Costa-jussà, 2015a) • Modify models  • Post-process the data (Toutanova et al., 2008;Bojar and Tamchyna, 2011;Formiga et al., 2013).
The research work in this area is being very active, e.g. PhD proposals using strategies based on deep learning (Gutierrez-Vasques, 2015).
Previous works on the Chinese-Spanish language pair focus on compiling corpus and using pivot stategies (Costa-jussà et al., 2012) and on building a Rule-Based Machine Translation (RBMT) system (Costa-jussà and Centelles, In Press 2015). A high-level description of the stateof-the-art of the translation on this language pair is detailed in (Costa-jussà, 2015b).
Our work mixes several strategies but basically it goes in the direction of (Formiga et al., 2013) that focuses on solving the challenge of morphology as a post-processing classification problem. The idea is to translate from Chinese to a morphology-based simplified Spanish and, then, re-generate the morphology by means of classification algorithms. The competitive advantage from this strategy is the rise of algorithms based on deep learning techniques that can achieve high success rates, e.g. (Collobert et al., 2011).

Theoretical Framework
The phrase-based SMT system (Koehn et al., 2003) is trained on a parallel corpus at the level of sentences. It learns co-ocurrences and each token in the training set is considered as a different one no matter if it is morphologically related. Therefore, in the extreme case where the word canto 3 is in the training set and the inflection of the same verb canté 4 is not, the latter is going to be considered an out-of-vocabulary word.
Strategy 1. One well-known strategy to face this challenge is to add a part-of-speech (POS) language model which evaluates the probability of the POS-sequences instead of the word sequences.
Strategy 2. This second strategy consists on doing a cascade of systems: first, translate from source to morphology-based simplified target; second, translate from this simplified target to fullform target as shown in Figure 3.
One straightforward simplification in morphology can be adopting lemmas as shown in Table 1.
Strategy 3. This third strategy is based on factored-based translation , which uses linguistic information of words, 3 I sing 4 I sang e.g. lemmas and POS. The idea is that the translation model based on words is used if the translation of the word is available, and if not, lemmas and POS are used in combination with a model to generate the final word. Figure 3 shows a typical representation of this factored strategy. Strategy 4. This fourth strategy is based on previous work like (Formiga et al., 2013), where the idea is to do a first translation from source to a morphology-based simplified target and then, use a classifier to go from this simplified target to the fullform target. See the schema of this classification-based strategy in 3.
The main challenges in the last strategy are: 1. Explore different simplifications of the target language in order to use the one with a higher trade-off between the highest oracle and the lowest classification complexity.
Es lemmas decidir examinar el cuestión en el período de sesión el tema titular " cuestión relativo a el derecho humano " Es N lemmas Decide examinar la cuestión en el período de sesión el tema titulado " cuestión relativas a los derecho humanos " . Es V lemmas decidir examinar la cuestión en el período de sesiones el tema titulado " Cuestiones relativas a los derechos humanos " . Es D lemmas Decide examinar el cuestión en el período de sesiones el tema titulado " Cuestiones relativas a el derechos humanos " . Es P lemmas Decide examinar la cuestión en el período de sesiones el tema titulado " Cuestiones relativas a los derechos humanos " . Es A lemmas Decide examinar la cuestión en el período de sesiones el tema titular " Cuestiones relativo a los derechos humano " . Estags VMIP3S0 VMN0000 DA0MS0 NCFS000 SPS00 DA0MS0 NCMS000 SPS00 NCFP000 DA0MS0 NCMS000 AQ0MS0 Fp NCFP000 AQ0FP0 SPS00 DA0MS0 NCMP000 AQ0MP0 Decide examinar la cuestión en el período de sesiones el tema titulado " Cuestiones relativas a los derechos humanos " .  In this paper, we study the first challenge of exploring different simplifications. However, we do not face the classification challenge, which is left to further work. It would be interesting to use deep learning knowledge which is leading to large improvements in natural language processing (Collobert et al., 2011).

Ongoing Experiments
In this section we show experiments and results with the four strategies proposed in the previous section.
As discussed in the literature, there are not many parallel corpora available for Chinese-Spanish (Costa-jussà et al., 2012). In this work, we use the data set from the United Nations (Rafalovitch and Dale, 2009). The training corpus contains about 60,000 sentences (and around 2 million words) and the development and test corpus contain 1,000 sentences each one. The base-line system is standard phrase-based SMT trained with Moses , with the default parameters. Table 2 shows results for the strategies 1, 2 and 3 in terms of BLEU (Papineni et al., 2002). From the BLEU scores, we see that strategy 1 gives slight improvements, but strategies 2 and 3 do not.  Table 2: BLEU scores for Zh2Es translation task and different morphology strategies. Table 3 shows several oracles for strategy 4 with different morphology-based simplifications of Spanish. Best oracles are for lemmas. Then, we explore other simplifications, including lemmatizing only: nouns (N), verbs (V), determiners (D), posesives (P) or adjectives (A). Non of these alternatives approach the best oracle from lemmatizing all words.
However, the interesting results are obtained when simplifying by number (num) and/or gender (gen). When simplifying number or gender, note that we use the information of lemmas and tags. When generalizing number, note that instead of using the information of singular (S) or plural (P) in the POS tag with the respective S or P, we use the generic N. Therefore, we generalize the information of number. Similarly when generalizing gender or both (numgen).
Oracles get closer to the lemmas simplification when only simplifying both number and gender in Spanish. This finding is relevant in the sense that it simplifies the classification task in the further work that we are considering.   Table 3. Note that simplifications in number and gender use lemmas plus POS tags to omit just the corresponding information that will need to be recovered in the classification stage.

Conclusions and Further Work
This paper presents an ongoing work on enhacing a standard phrase-based SMT system by dealing with morphology. We have reported several strategies including adding POS language modeling, experimenting with cascade systems and factoredbased translation models. Only the first one reported improvements over the baseline. An additional strategy consists of studying different Spanish simplifications and then, generating the fullform with classification techniques. Experiments show that simplification only in gender and number almost achieves improvements as good as the simplification on lemmas. This is an interesting result that reduces the level of complexity for the classification task. As further work, we will use classification techniques based on deep learning.