Testsuite on Czech–English Grammatical Contrasts

We present a pilot study of machine translation of selected grammatical contrasts between Czech and English in WMT18 News Translation Task. For each phenomenon, we run a dedicated test which checks if the candidate translation expresses the phenomenon as expected or not. The proposed type of analysis is not an evaluation in the strict sense because the phenomenon can be correctly translated in various ways and we anticipate only one. What is nevertheless interesting are the differences between various MT systems and the single reference translation in their general tendency in handling the given phenomenon.


Introduction
English and Czech are typologically different languages. It goes without saying that some structural phenomena of either lack a direct structural equivalent in the other; for instance, Czech has not grammaticalized noun definiteness, while it boasts a complex system of verb aspect, which is absent in English. Such 1:n correspondences can pose translation problems in human as well as in machine translation. Intuitively, a translation system that has mastered these 1:n phenomena ought to be more successful than one that has not. Therefore we investigate whether there is a positive correlation between mastering some of these problematic phenomena and the performance of an En-Cs MT system.

Selected Linguistic Phenomena
Based on our experience as Czech learners of English, translators and developers/evaluators of MT systems, we have selected the following phenomena for EN-CS translation evaluation: English gerundial clause and English verb control with controlled infinitive.
The data comes from a manually-parsed, wordaligned parallel treebank of English news texts and their human Czech translations (see Section 3).

English gerundial clause (and other ing-forms)
Modern Czech has no counterpart of the English gerund. Older Czech (i.e. until approximately 1950), used to have a verb form called present transgressive, which would be very handy to translate many cases of English gerundial clauses, but this form is perceived as archaic and hardly ever used. Modern Czech has the following options to render the English gerund: 1. finite clause with a choice of subordinators or conjunctions; 2. non-finite clause (infinitive clause, nominalization, or adjective/present participle).
In this study we tested whether the Czech equivalent in the reference vs. automatic translation was a finite clause or anything else.

Czech finite clause as equivalent to English gerundial clause
Czech is more sensitive to convoluted expressions than English. Therefore non-finite clauses are usually most smoothly translated with finite clauses. To keep the Czech text coherent, though, human translators usually link the gerundial clause to the main clause with an explicit discourse connective -either a conjunction or a subordinator, based on their knowledge of context and their world knowledge. This may pose a challenge for MT systems. The most typical discourse connectives used to translate gerundial clauses would be -li (a clitic if or whether), což (which referring to a predicate), protože (because), když (when),že (that as subordinator), jak (approximately as expressing an event parallel to the main-clause event), and a (and). Example: (1) When they arrived at the door, all were afraid to go in, fearing that they would be out of place. Ale když přišli ke dveřím, všichni se báli vstoupit, protože se báli,že budou působit trapně.
(But when they arrived at the door, all were afraid to go in, because they feared that they would be out of place. 1 ) (2) He said he was surprised by the EC's reaction, calling it "vehement, even frenetic." Rekl,že byl překvapen reakcí ES, a nazval ji "prudkou, ba i bouřlivou". (He said he was surprised by the EC's reaction, and he called it "vehement, even frenetic".)

Czech infinitive as equivalent to English gerundial clause
Infinitive clause occurs in our sample to translate gerundial clauses in the subject position and in control in some verbs. Example: (3) Avoiding failure is easy. Vyhnout se neúspěchu je snadné.
(To avoid failure is easy.) (4) So far no one has suggested putting the comptroller back on the board. Zatím nikdo nenavrhl znovu dosadit do Rady také kontrolora.
(So far no one has suggested to put the comptroller back on the board.)

Nominalizations as equivalents to gerundial clause
The choice between deverbal noun and event noun is lexically motivated. A deverbal noun is a noun derived from a verb stem by suffixes -ní, -tí; e.g. stát v. -stání n., proklít v. -prokletí n. This is an almost universal derivational mechanism, but it is stylistically associated with officialese and easily overused. An event noun is a noun with either no derivative relation to any semantically close verb stem (restaurace, ) or a less productive derivation relation to a verb stem; e.g. podpořit v. -podpora n., letět v. -let n.). Also these nominalizations are to be used sparingly to preserve readability. Example:

Present participle as equivalent to gerundial clause
The Czech present participle is derived from a verb but behaves like a regular adjective, including inflection; e.g. spát v. -spící adj.
As an equivalent to the English gerundial clause it requires a syntactic transformation of the source clause, approximately as though the original clause contained a participial clause instead of the gerund. Square brackets in the following example show the syntactic dependencies in English imagined by the translator and the corresponding structure in Czech. The main predicate is typeset in bold. Example:

English infinitive clause
The English infinitive clause has many functions; e.g. verb control or a convoluted subordinate clause. Infinitive as controlled verb in verb control is present in both languages, but the many other uses of the English infinitive clause have different structure equivalents in Czech-mostly different types of finite subordinate clauses. A correct parsing would possibly make it easier for an MT system to select a plausible Czech equivalent structure, but the parser was not able to reliably identify the correct syntactic governing node of an infinitive clause in our data sample.
Since we could not rely on the parser to tell infinitive clause as an argument from an adjunct, we did not limit our search to arguments. Our sample contains the following Czech structural equivalents to English infinitive clauses: 1. infinitive or noun phrase; 2. finite clause.

Infinitive as controlled verb
A proportion of verb control cases have a 1:1 translation to Czech. Example: (10) Comair said it paid cash but declined to disclose the price. Společnost Comair uvedla,že zaplatila hotově, avšak odmítla uvést cenu.
However, many English controlling verbs have a Czech equivalent verb that cannot act as a controlling verb. To avoid a verbose paraphrase with an expletive pronoun and a subordinate content clause, Czech can resort to a nominalization (deverbal noun or event noun; see Section 2.1.3): Example: English has an infinitive structure that resembles a consecutive clause but involves a semantic shift towards temporal sequence of two events. This structure exists in Czech, too, but it is not common. A more natural translation would use a coordination of finite clauses. Example:  et al., 1994) and its translation into Czech. It has two syntactic layers of rooted dependency trees with labeled edges: the analytical (a-) layer with surface syntax and the tectogrammatical (t-) layer with deep syntax. In the a-layer, each word token is represented by one node. The inner structure of each node contains the word form, lemma, POS-tag, dependency label (afun), and reference to the governing node. The t-layer represents the linguistic meaning of each sentence by a tree that somewhat abstracts from details of morphology and surface syntax, but remains, by and large, a syntactic dependency tree. Each node contains references to the a-layer corresponding a-layer node(s), along with a whole range of other attribute values. Different reference types to content and auxiliary words, respectively. Apart from that, the t-layer provides semantic role labeling (functors), as well as coreference and ellipsis resolution. Figure 1 illustrates the data structure of PCEDT 2.0 including the alignment links pointing from English to Czech.
We have automatically selected 3782 sentences, using the the PMLTQ search query engine (Štěpánek and Pajas, 2010), using the Czech counterpart of the corpus as reference translation. All the pre-selected sentences were included in inputs of MT systems participating in the WMT18 News Translation Task. In addition to the "primary" systems CUNI Transformer, UEDIN and the online systems, we also added three baseline (contrastive) systems: CUNI Chimera, CUNI Chimera noDepfix and CUNI Moses.
CUNI Transformer is a carefully trained system (Popel and Bojar, 2018) based on the Transformer architecture (Vaswani et al., 2017) and thus without recurrent connections.
UEDIN is an ensemble of deep RNN systems translating left-to-right and reranked by a deep right-to-left RNN model. CUNI Moses serves as the ultimate baseline. It is phrase-based (Koehn et al., 2007) and trained on a very large parallel corpus and further adapted for the news text.
CUNI Chimera is the hybrid setup that served very well in 2013-2015 (Bojar et al., 2013). A phrase-based backbone is used to combine translations by a transfer-based system TectoMT (Žabokrtský et al., 2008), by Nematus (Sennrich et al., 2017) and by Neural Monkey (Helcl et al., 2018) with phrase pairs from the large parallel corpus. The final step of Chimera was the application of a dependency-based automatic error correction tool Depfix (Rosa et al., 2012). In this paper we report the performance of both the full CUNI Chimera and a version without a the depfix post-correction, labelled CUNI Chimera noDep-Fix.
Since our sentences originally come from the WSJ section of the Penn Treebank, they belong to the domain of the translation task.

Evaluation
For each phenomenon we implemented a small test relying on an automatic analysis of the source English to the surface syntactic tree (a-layer, in the terminology of PCEDT), an automatic analysis of the Czech translation to surface (a-layer) along with a deep (t-layer) syntactic tree, and on automatic word alignments between the English alayer and Czech a-layer and t-layer. We aligned directly English to each of the Czech layers; a more rigorous approach would have been aligning only the a-layers and follow the links between a-layer and t-layer on the Czech side, but since all our annotations are automatic, we do not expect much difference in these approaches due to random errors in all processing steps. The annotation was provided by the pipeline used in the creation of corpus CzEng (Bojar et al., 2016) 3 as implemented in the Treex toolkit (Popel anď Zabokrtský, 2010). For the alignment, we relied on an intersection of GIZA++ (Och and Ney, 2000) alignments.
The test searched for the keyword related to the phenomenon (e.g. the controlled English verb), followed the word-alignment links to the tested some morphological or syntactic properties of the corresponding Czech word or node in t-layer analysis. The result of the test was "Good" if the Czech expression was the best possible translation, "Bad" otherwise, and "Unknown" if the target word or node was not found, e.g. due to errors in word alignment.
It is important to note that "Bad" does not always mean an inacceptable translation. It merely means that the translation is not the most straighforward one. Table 1 below presents the detailed results of these tests.
While the manual evaluation of WMT18 systems is not yet available, we can assume that it will match the automatic evaluation available at http://matrix.statmt. org/matrix/systems_list/1883 and reproduced here in Table 2. One caveat to keep in mind is that this evaluation is based on a different set of sentences than we use in our testsuite.
Disregarding the "Unknowns", we plot the results in Figure 2 and Figure 3 by systems and by phenomena, respectively.

Discussion
One observation is that the reference generally adopts the most typical translation in all the phenomena. (A small exception is the performance of UEDIN in EN-gerund-CS-finclause on the refined set; not confirmed on the larger set though.) At the same time, the reference does not always match our expectation. The most divergent phenomenon is EN-gerund-CS-finclause where the reference uses the expected finite clause only in 56/(56 + 25.3) = 68.9% of cases.
In general, the reference seems a little harder to process ("Unk" higher than for MT systems), probably due to a less verbatim translation and thus a less straightforward word alignment.
The order of MT systems does not match their overall automatic performance. CUNI Transformer, the best-performing system overall (and a system that is actually likely to surpass humans this year in sentence-level evaluation) appears in the middle of our list. This suggests that Transformer outputs may be "more creative", departing more from the reference. UEDIN, on the other hand, seems to be very close to the reference in the studied phenomena. Finally, phrase-based Moses has been clearly surpassed in all evaluations. As a next step, we plan to obtain and compare manual evaluations of individual sentences . The annotators will rate the automatic and reference translations alike, without knowing which is which. For each system, we will compare the correlation between the quality rating and the agreement with the reference translation.

Conclusion
We have presented a testsuite focused on English-Czech translation of a small set of extremely frequent verb-related phenomena. The testsuite of about 3000 automatically preselected sentences reveals that the two overall top-performing systems UEDIN and CUNI Transformer differ considerably in their handling of the phenomena. Further investigation, esp. in link with the manual annotation which is now running for WMT18, is needed to validate whether the less expected translations for our selected phenomena reflect the assessed translation quality.
The dataset is publicly accessible via the LINDAT-CLARIN repository:    : Performance by linguistic phenomena on the refined set. Each facet represents one pair of En phenomenon -Cs translation option. Each bar represents one MT system. The result is computed as the proportion of agreements of the given MT system with the reference in the total number of cases (x-scale). In addition, we display the exact number inside each bar.