SMT error analysis and mapping to syntactic, semantic and structural fixes

This paper argues in favor of a linguistically-informed error classiﬁcation for SMT to identify system weaknesses and map them to possible syntactic, semantic and structural ﬁxes. We propose a scheme which includes both linguistic-oriented error categories as well as SMT-oriented edit errors, and evaluate an English-Spanish system and an English Basque system developed for a Q&A scenario in the IT domain. The classiﬁcation, in our use-scenario, reveals great potential for ﬁxes from lexical semantics techniques involving entity handling for IT-related names and user interface strings, word sense disambiguation for terminology, as well as argument structure for prepositions and syntactic parsing for various levels of reordering.


Introduction
Once we build a baseline SMT system, we run an evaluation to check its performance and guide improvement. Given the nature of statistical systems and their learning process, linguistic-oriented error analysis has been considered unfit for their evaluation. Even when it is identified that a particular linguistic feature is incorrectly handled, it is not clear how to specifically address it during training if we resort to common generic, non-deterministic techniques. However, when syntax, semantics and structure (SSS) come into play, error analysis regains relevance, as it can pinpoint specific aspects that can be addressed through the more targeted techniques they have brought to MT development.
Based on two baseline SMT systems, one for the English-Spanish pair and one for English-Basque, we present a methodology and classification for error analysis, a description of the results and a mapping to possible fixes using SSS techniques.

Error classification schemes
Different classification schemes have been proposed in the last years to categorize machine translation errors. Starting in the 90s, the LISA QA model was adopted by good part of the industry. 1 This model included a list of "objective" error types, graded by their severity and pre-assigned penalty points. The SAE J2450 standard, from the automotive service, also became popular. 2 What became clear from these first efforts was that no one-fits-all evaluation scheme is possible for MT. Each player within the translation workflow, from developers to vendors and clients, has its own needs and the information they expect from the evaluations is different.
After LISA ceased operations, two major efforts emerged: TAUS presented its Dynamic Quality Framework (DQF) 3 and the QTLaunchPad project developed the Multidimensional Quality Metrics (MQM). 4 The DQF tackles quality evaluation by identifying the objective of each evaluation and by offering a bundle of tools to satisfy each need. Specifically, they offer productivity testing based on post-editing effort, adequacy and fluency tests, translation comparisons and error classification. The error scheme, proposed after a thorough examination of industry practices, covers four main areas, namely, Accuracy, Language, Terminology and Style, with limited subcategories. With a strong industrial view, it focuses on establishing return-oninvestment and on benchmarking performance to allow for informed decisions, rather than providing a detailed development-oriented error analysis.
The MQM is a framework that can be used to define metrics in order to assign a level of quality to a text. Each evaluation must identify the relevant categories for its goals and customize the metric. MQM Core is a hierarchy of 22 issues, at different levels of granularity. If we consider Accuracy and Fluency, the two top-level categories that best focus on intratextual diagnosis, subcategories branch out and get more detailed, although they remain at a relatively general level. Authors claim that considerably more detailed subclasses might be necessary to diagnose MT problems and the framework allows for userdefined extensions, even if this is not encouraged.
The MQM puts together three different dimensions of error classification. The two top-level categories, Accuracy and Fluency, can be seen as the effect the errors have on a translated text. The concepts in the lower-levels include concepts of yet another two dimensions. Some of the subcategories refer to actual errors systems make, such as mistranslation or grammar, whereas others refer to the way in which these errors are rendered, namely, omission, addition and incorrect. When trying out the scheme to perform our evaluation, we saw that the distinction between fluency and accuracy might, to some extent, be useful when prioritizing fixes. However, we found difficulty in assigning an error to a specific subclass, as overlaps between dimensions occurred constantly. For example, grammar is placed under fluency but we could argue that an incorrect tense might lead to a significant change in meaning, and therefore, result in an accuracy issue. Similarly, one could claim that the rendering possibilities are true for almost, if not all, types of errors, rather than a category of their own. For example, Addition is a direct subclass of Accuracy, even if it is possible to find extra function words in a translation. Also, we strongly felt that some subclasses were too broad to be meaningful to decide on a targeted SSS solution.
Among schemes that have emerged from research groups, Vilar et al. (2006) presented one of the first to focus on identifying errors made by statistical systems. Probably motivated by the fact that these systems are not controlled by linguistic rules and are not deterministic in this respect, the top-level categories proposed were Missing words, Word order, Incorrect words, Unknown words and Punctuation, that is, types of edits unrelated to linguistic reasoning. The lower categories are slightly more linguistic but they remain on SMT parameters such as local/long range, stems and forms. While Word order and Unknown words point to specific efforts for improvement, the Incorrect words category is broad and requires, as the authors suggest, further customization depending on the language pair at hand. Again, this classification lacks the linguistic detail we aimed to collect for linguistically-oriented fixes.

Classification schemes: our approach
Given our goal and the nature of our systems, we opted for a general linguistic classification with an additional dimension to cover the edit type of each error: missing, additional or incorrect (Figure 1). Once a linguistic error is identified, it is classified based on the edit-type dimension. We established six top-level linguistic categories, which are further detailed in subclasses. These subclasses are not static but rather they can be omitted or extended during evaluation to suit errors found in texts. The linguistic depth and the clear division between dimensions overcomes the lack of detail of the DQF model and the overlaps that emerged in the MQM model, while incorporating the SMT-oriented edits proposed by Vilar et al. (2006).
We worked with a two-to-four-level scheme to gather as much detail as possible about the errors found. We describe the six main categories below.

Lexis
This category includes incorrect choices for general vocabulary and terminology, as well as longer set phrases, idioms or expressions.

Morphosyntax
This category includes morphological and syntactic errors. We fused both categories as these types of errors are often so intertwined that it is difficult to opt for one category over the other. Moreover, the classification is proposed as a tool to easily summarize and assimilate system error information and the exact top-level classification of the items should not have an impact on research decisions. This should be guided by their fixing requirements and possibilities.

Verbs
A separate category was defined for verb phrases because of their complexity. Whereas English verb phrases carry lexical, aspectual, tense, modality and voice information, Spanish verb phrases also have subject information, and in the case of Basque, information about objects is also included. The high variability of conjugated verbs and auxiliaries poses great difficulty for statistical systems. We divided this category into subgroups based on the information mentioned above.

Order
Again, this is a dedicated category due to the impact order has on the overall comprehensibility of the translations and because it is a property that can be addressed specifically in statistical systems. We distinguished several levels: sentence, clause and phrase. Also, we identify whether the issues involve orderings of units of the same level or, unit-specific issues, which can be internal orderings or splits.

Punctuation
This category includes punctuation and orthographic issues such as punctuation marks, capitalization and orthotactic constrains (orthographic rules governing lemma-affix gluing).

Untranslated
We added a category for source words that are left in the original language.
3 The systems

English-Spanish
The English-Spanish system is a standard phrasebased system built on Moses ). It uses basic tokenization and a pattern excluding URLs, truecasing and language model interpolation. It has been trained on bilingual corpora including Europarl, United Nations, News Commentary and Common Crawl (∼355 million words). The monolingual corpora used to learn the language model include the Spanish texts of Europarl, News Commentary and News Crawl (∼60 million words). For tuning, a set of 1,000 in-domain interactions (questionanswer pairs) were made available. The original interactions are in English and they were translated into Spanish by human translators.
The system was evaluated on a test-set similar to that used for tuning: a second batch of 1,000 indomain interactions. The English-Spanish system obtains a BLEU score of 45.86.

English-Basque
The English-Basque system is also a standard phrase-based system built on Moses. It uses basic tokenization, lemmatization and lowercasing. Stanford CoreNLP (Manning et al., 2014) is used for English analysis and Eustagger (Alegría et al., 2002) for Basque. It uses a 5-gram language model. To better address the agglutinative nature of Basque, the word alignments were obtained over the lemmas, and were then projected to the original word forms to complete the training process.
The system was trained on translation memory (TM) data containing academic books, software manuals and user interface strings (∼12 million words), and web-crawled data (∼1.5 million words) made available by Elhuyar. 5 For the language model, the Basque text of the parallel data and the Basque text of Spanish-Basque TMs of administrative text made available by Elhuyar (∼7.4 million sentences) was used. Again, a set of 1,000 indomain interactions were used for tuning after manually translating the original text into Basque.
The system was evaluated on a second test-set of 1,000 in-domain interactions, obtaining a BLEU score of 20.24. ordering Tap "Import" to copy your Android browser favorites. Toca "Importar" para copiar su navegador de Android favoritos. (∼your favorites Android browser) punctuation If I buy a computer abroad, will it work in Portugal Si compro un ordenador en el extranjero, funcionará en Portugal? (missing ¿) untranslated Then click on the yellow disc with a green tick.
Then haga clic en el disco de color amarillo con una marca verde.  Table  1). Around half emerge from the translation of user interface (UI) strings. Although it was not possible to identify whether the translations matched the final software version text exactly, in some cases the translations are clearly awkward. Problems are most relevant in multi-word strings, which are not translated as a unit, resulting in partial translations and inadequate capitalization. The translations of software and brand names display a similar behavior. These proper names tend to stay the same across languages, but the system does not always treat them this way. Adding to this, multiword names often get part of the name translated.
Issues with general vocabulary and terminology (we will consider terminology words that acquire a specialized meaning in our domain or words that are specific to our domain) are also present. Whereas some inadequate translations do not have a clear origin, a good number of them clearly emerge from incorrect word sense disambiguation.
Morphosyntactic errors account for about 29% of the total errors. Although they are very widespread across the different subcategories, we find that 6 For a complete classification see appendices A and B.
prepositions, subordinate markers and POS errors are the most recurrent cases.
The Verbs category accounts for 18% of the errors. Although a number of verbs lack the correct agreement or use an inadequate tense or voice, the most recurrent error seems to come from the mode. This is typical of instructional texts, where orders, given with the infinitive form in English can be translated as imperatives or infinitives. This is usually a stylistic decision but one that needs to be consistent across the documentation and, in particular, within the sentence or paragraph.
A number of order issues have been identified (11%), which mainly involve the composition of multiword noun phrases. We found 7 cases where a noun phrase was split and 7 cases where the elements were incorrectly ordered despite staying in close proximity.
Punctuation errors (6%) and untranslated words (5%) are low. The former include cases of incorrect capitalization and use of question-initial marks. The latter involve function and content words.

Error analysis for the English-Basque pair
We again performed a random selection of 100 interventions. Based on overall counts, 6 out of 140 sentences were correct and the remaining 134 included 393 errors, at least 7 errors per intervention.
Lexical errors account for around 23% of the total (Table 2). Despite a number of errors due to incorrect word sense disambiguation, most errors emerge from UI strings and software/brand name translations. Capitalization errors in these units were included in this subcategory (36 cases).
Morphosyntactical errors account for over 39%  of the total errors. Most, around 64%, concern the translation of prepositions and subordinate conjunctions. In Basque, prepositions are translated into postpositions that are attached to the last word of the phrase (the nucleus) and the same happens with subordinate markers, attached to the last word of the subordinate clause. It is worth noting the high number of missing elements in this subcategory, 90 cases recorded out of 149 (10 cases out of 49 for Spanish). Verbs show a considerable number of errors (18%), specially if we take into account that 21 main verbs, which display the lexical meaning and the aspect, and 23 auxiliaries, which display tense, mode and paradigm, are missing. Out of the verb phrases that are constructed, the aspect, the paradigm and agreements generate errors.
Order errors account for 14% of the total errors. The sequencing of noun phrase elements stands out as the main source of errors, whether within the phrase or because splits occurred. The positioning of relative clauses with respect to their heads also emerged as a problematic area with 11 occurrences.

Fixing possibilities with syntax, semantics and structure
From the error analysis of the English-Spanish and English-Basque systems we see that errors emerge from two main sources, use-scenario-specific features and language pair-specific features.
The text-type and domain of the translations has an impact on the difficulties the system encounters. In the case we present, we work on a questionand-answer (Q&A) scenario in the information tech-nology (IT) domain. The texts, therefore, mainly consist of instructions and descriptions, and include a high degree of terminology, brand and software names, as well as UI strings. And our systems have difficulty in dealing with them.
Lexical semantics, and in particular, (crosslingual) named-entity recognition (NER) and translation techniques could greatly benefit our application scenario. Following the implementation of NER in MT by Li et al. (2013),  and similar, it would be possible to train a NER system to identify IT names. We could possibly create a separate category for the disambiguation process (NED) if we envisage to treat them in a specific way. For example, we may decide that NEs classified as ITname should be left in English, or that they should be looked up in Wikipedia following techniques such as Mihalcea and Csomai's (2007) and Agirre et al.'s (2015) to find an equivalent entry in the target language, and as a result, its translation. Maybe we could opt for dynamic searches in multilingual websites of specific brands or the use of pre-compiled dictionaries from these resources.
The NER system could be expanded to include UIs. Cues to identify them could be anchors like icon, tab and dialog box, and phrases such as where it says, and > sequences. The systems had difficulty in identifying UIs and often provided translations that differ significantly from the strings we are used to seeing in software graphics. UIs usually have a fixed translation -often given by the productmaker -and they must be treated as proper nouns in the sense that they are usually capitalized (first word only if multiword) and do not accept articles. We could chose to identify them and translate them us-ing a specialized dictionary or even let the MT system output a candidate which considers the restrictions just mentioned.
Sense disambiguation, whether for general words or terms, has also been identified as a category worth addressing. Word sense disambiguation techniques along the line of Carpuat et al. (2013), for example, could help. They propose a technique to identify unknown senses to the system, most probably because they are domain-specific senses not covered by the training corpus. Once marked, we could divert them and translate them using a specialised resource.
Out of the language pair-specific errors, the most glaring are Basque postpositional renderings of English prepositions. Predicate-argument structures and semantic roles, as suggested by the work of Liu and Gildea (2010) and Kawahara and Kurohashi (2010), are a way to improve the incorrect renderings and to force missing postpositions. Resources such as the Basque Verb Index (BVI) (Estarrona et al., forthcoming), which includes Basque verb subcategorization based on PropBank and VerbNet, with syntactic renderings assigned to each argument and mappings to WordNet for crosslingual information, can be a starting point in this task.
Order errors have shown three types of issues: (i) phrases or chunks ordered incorrectly; (ii) phrases split along the sentence; and (iii) phrasal elements kept local but with incorrect phrase-internal order. For the first case, semantics has proposed the use of argument structure to learn reordering patterns (Wu et al., 2011). For cases ii and iii, syntax would have to come into play. Firstly, we need to provide the MT with phrase boundary information so that contiguous phrases are not mixed. Secondly, phraseinternal reordering patterns or restrictions need to apply. Yeniterzi and Oflazer (2010), for example, encode a variety of local and non-local syntactic structures of the source side as complex structural tags and include this information as additional factors during training. Also, working on POS, Popović and Ney (2006) propose source-side local reordering patterns for Spanish-English and, working on syntactic parse-level, Wang et al. (2007) propose reordering patterns to address systematic differences (Chinese-English). Xiong et al. (2010) go beyond syntax and propose translation zones as unit boundaries, improving constituent-based approaches.
We finally focus on the generation of verb phrases, particularly relevant for the English-Basque pair, where verbs tend to go missing, but also to remedy incorrect verbal features in both pairs. The sparsity due to the complexity and morphological variety of Spanish and, even more so, Basque verb phrases is most probably the main reason for their incorrect handling. This leads us to proposing the generalization of features, such as lemmatization of verbs, while suggesting a parallel transfer of source verb features to final postprocessing, for instance. Work on verbal transfer has not received attention so far, unless integrated within argument structure techniques, such as the work of Xiong et al. (2012).

Conclusions
We proposed a dynamic, extensible linguisticallyinformed error classification for SMT which includes six top-level linguistic error categories with further subclasses, and a second dimension for SMToriented edits covering additions, omissions and incorrect words. This addresses the lack of linguistic detail and flexibility of metrics such as the DQF, and integrates the SMT-oriented errors proposed by Vilar et al. (2006) avoiding overlaps found in MQM.
We evaluated an English-Spanish and an English-Basque system developed for a Q&A scenario in the IT domain. The classification revealed issues strongly related to the domain and more general language pair-specific errors. We identified terminology and UI strings as the main issue for the lexical category. The morphosyntactic category showed more diverging issues. The most striking was the weak handling of English prepositions, and in particular, the poor generation of Basque postpositions, governing English prepositions and subordinate markers. The complexity of target-side verbs also took its toll on system performance with incorrect features for Spanish and an alarming number of missing main verbs and auxiliaries for Basque. As expected, ordering errors occurred at all levels, internal and external. Punctuation and Untranslated showed a low number of errors.
The exercise served to link the potential relevance of syntax, semantics and structure to fix languagespecific SMT errors and the suitability of lexical semantics for IT-domain terminology and UI strings.