Automatically Extracting Challenge Sets for Non-Local Phenomena in Neural Machine Translation

We show that the state-of-the-art Transformer MT model is not biased towards monotonic reordering (unlike previous recurrent neural network models), but that nevertheless, long-distance dependencies remain a challenge for the model. Since most dependencies are short-distance, common evaluation metrics will be little influenced by how well systems perform on them. We therefore propose an automatic approach for extracting challenge sets rich with long-distance dependencies, and argue that evaluation using this methodology provides a complementary perspective on system performance. To support our claim, we compile challenge sets for English-German and German-English, which are much larger than any previously released challenge set for MT. The extracted sets are large enough to allow reliable automatic evaluation, which makes the proposed approach a scalable and practical solution for evaluating MT performance on the long-tail of syntactic phenomena.


Introduction
The assumption that proximate source words are more likely to correspond to proximate target words has often been introduced as a bias (henceforth, locality bias) into statistical MT systems (Brown et al., 1993;Koehn et al., 2003;Chiang, 2005). While reordering phenomena, abundant for some language pairs, violate this simplifying assumption, it has often proved to be a useful inductive bias in practice, especially when complemented with targeted techniques for addressing non-monotonic translation (e.g., Och, 2002;Chiang, 2005). For example, if an adjective precedes a noun in one language and modifies it syntactically, it is likely that their corresponding words will appear close to each other in the translation -i.e., they may not be immediately adjacent or even in the same order in the translation, but it is unlikely that they will be arbitrarily distant from one another.
In the era of Neural Machine Translation (NMT), such biases are implicitly introduced by the sequential nature of the LSTM architecture (Bahdanau et al., 2015, see §2). The influential Transformer model (Vaswani et al., 2017) replaces the sequential LSTMs with self-attention, which does not seem to possess this bias. We show that the default implementation of the Transformer does retain some bias, but that it can be relieved by using learned positional embeddings ( §3).
Long-distance dependencies (LDD) between words and phrases present a long-standing problem for MT (Sennrich, 2016), as they are generally more difficult to detect (indeed, they pose an ongoing challenge for parsing as well (Xu et al., 2009)), and often result in non-monotonic translation if the target differs from the source in terms of its word order and lexicalization patterns. The Transformer's indifference to the absolute position of the tokens raises the question of whether longdistance dependencies are still an open problem.
We address this question by proposing an automatic method to compile challenge sets for evaluating system performance on LDD ( §4). We distinguish between two main LDD types: (1) reordering LDD, namely cases where source and target words largely correspond to one another but are ordered differently; (2) lexical LDD, where the way a word or a contiguous expression on the target side is translated is dependent on non-adjacent words on the source side.
We define a methodology for extracting both LDD types. For reordering LDD, we build on Birch (2011), whereas for lexical LDD we compile a list of linguistic phenomena that yield LDD, and use a dependency parser to find instances of these phenomena in the source side of a parallel corpus. As a test case, we apply this method to construct challenge sets ( §4.2) for German-English and English-German. The approach can be easily scaled to other languages for which a good enough parser exists. Experimenting both with RNN and selfattention NMT architectures, we find that although the latter presents no locality bias, LDD remain challenging. Moreover, lexical LDD become increasingly challenging with their distance, suggesting that syntactic distance remains an important determinant of performance in state-of-the-art (SoTA) NMT.
We conclude that evaluating LDD using targeted challenge sets gives a detailed picture of MT performance, and underscores challenges the field has yet to fully address. As particular types of LDD are not frequent enough to significantly affect coarse-grained measures, such as BLEU (Papineni et al., 2002) or TER (Snover et al., 2006), our evaluation approach provides a complementary perspective on system performance.

Long-distance Dependencies in MT
A common architecture for text-to-text generation tasks is the (Bi)LSTM encoder-decoder (Bahdanau et al., 2015). This architecture consists of several LSTM layers for the encoder and the decoder and a thin attention layer connecting them. LSTM is a recurrent network with a state vector it updates. At every step, it discards some of the current and past information and aggregates the rest into the state. Any information about the past comes from this state, which is a learned "summary" of the previous states (cf. Greff et al., 2017). Hence, for information to reach a certain prediction step, it should be stored and then kept throughout the intermediate steps (tokens). While theoretically information could be kept indefinitely (Hochreiter and Schmidhuber, 1997), practical evidence shows that LSTMs performance decreases with the distance between the trigger and the prediction (Linzen et al., 2016;, and that they have difficulties generalizing over sequence lengths (Suzgun et al., 2018).
Despite being affected by absolute distances between syntactically dependent tokens (Linzen et al., 2016), LSTMs tend to learn to a certain extent structural information even without being instructed to do so explicitly (Gulordava et al., 2018). Futrell and Levy (2018) discuss similar linguistic phenomena to what we discuss in §4.2, and show that LSTM encoder-decoder systems handle them better than previous N-gram based systems, despite being profoundly affected by distance.
Transformer (Vaswani et al., 2017) models are also encoder-decoder, but instead of LSTMs, they use self-attention. Self-attention is based on gating all outputs of the previous layer as inputs for the current one; put differently, it aggregates all the input in one step. This approach makes information from all parts of the input sequence equally reachable. While this is not the only architecture with such attributes (van den Oord et al., 2016), we focus on it due to its SoTA results for MT (Lakew et al., 2018). The Transformer's use of self-attention inspired other works in related fields (Devlin et al., 2018), some of which attributed their performance gains to the model's ability to capture long-range context (Müller et al., 2018).
As the Transformer does not aggregate input sequentially, token positions must be represented through other means. For that purpose, the embedding of each input token W is concatenated with an embedding of its position in the source sentence P . While positional embeddings can generally be any vectors, two implementations are commonly used (Tebbifakhr et al., 2018;Guo et al., 2018): learned positional embeddings (learnedPEs; P is randomly initialized), and sine positional embeddings (SinePEs) defined as: P (pos,2i) = sin(pos/10, 000 2i/dim ) P (pos,2i+1) = cos(pos/10, 000 2i/dim ) where dim is the dimension of the embedding. Vaswani et al. (2017) report that they see no benefit in learnedPEs, and hence use SinePEs, which have much fewer parameters.
Most of the dependencies between words are short. Short-distance linguistic dependencies include some of the most common phenomena in language, such as determination, modification by an adjective and compounding. For example, 62% of the dependencies in the standard UD EWT training set (Silveira et al., 2014) are between tokens that are up to one word apart. It stands to reason that the locality bias is useful in these cases. Nevertheless, as system quality improves, rarer, more challenging dependencies become a priority, and languages present a countless number of longdistance reordering phenomena (Deng and Xue, 2017). One example is subject-verb agreement, where a correct translation requires that the verb is inflected according to the headword of the subject (e.g., in English "dogs that ..., bark", while "a dog that ..., barks"). When translating such cases, a locality bias may impede performance, by biasing the model not to attend to both the subject's head and the main verb (which may be arbitrarily distant), thereby disallowing it to correctly inflect the main verb.
Due to the benefits of the locality bias, it featured prominently in statistical MT, including in the IBM models, where alignments are constrained not to cross too much (Brown et al., 1993), and in predicting probabilities of reorderings (Koehn et al., 2003;Chiang, 2005). Difficulties in handling LDD have motivated the development of syntax-based MT (Yamada and Knight, 2001), that can effectively represent reordering at the phrase level, such as when translating between VSO and SOV languages. However, syntaxbased MT models remain limited in their ability to map between arbitrarily different word orders (Sun et al., 2009;Xiong et al., 2012). For example, reorderings that violate the assumption that the trees form contiguous phrases would be difficult for most such models to capture. In the next section ( §3) we show that the Transformer, when implemented with learnedPEs, presents no locality bias, and hence can, in principle, learn dependencies between any two positions of the source, and use them at any step during decoding.

MT Evaluation
With major improvements in system performance, crude assessments of performance are becoming less satisfying, i.e., evaluation metrics do not give an indication on the performance of MT systems on important challenges for the field (Isabelle and Kuhn, 2018). String-similarity metrics against a reference are known to be partial and coarsegrained aspects of the task (Callison-Burch et al., 2006), but are still the common practice in various text generation tasks. However, their opaqueness and difficulty to interpret have led to efforts to improve evaluation measures so that they will better reflect the requirements of the task (Anderson et al., 2016;Sulem et al., 2018;Choshen and Abend, 2018b), and to increased interest in defin-ing more interpretable and telling measures (Lo and Wu, 2011;Hodosh et al., 2013;Choshen and Abend, 2018a).
A promising path forward is complementing string-similarity evaluation with linguistically meaningful challenge sets. Such sets have the advantage of being interpretable: they test for specific phenomena that are important for humans and are crucial for language understanding. Interpretability also means that evaluation artefacts are more likely to be detected earlier. So far, such challenge sets were constructed for French-English (Isabelle et al., 2017;Isabelle and Kuhn, 2018) and English-Swedish (Ahrenberg, 2018) 2 . Previous challenge sets were compiled by manually searching corpora for specific phenomena of interest (e.g., yes-no questions which are formulated differently in English and French). These corpora are carefully made but are small in size (ten examples per phenomenon), which means that evaluation must be done manually as well.
As our methodology extracts sentences automatically based on parser output, we are able to compile much larger challenge sets, which allows us to apply standard MT measures to each subcorpus corresponding to a specific phenomenon. The methodology is, therefore, more flexible, and can be straightforwardly adapted to accommodate future advances in MT evaluation.

Locality in SoTA NMT
In this section we show that encoder-decoder models based on BiLSTM with attention (see §2), do exhibit a locality bias, but that the Transformer, whose encoder is based on self-attention, and in which token position is encoded only through learnedPEs, does not present any such bias.

Methodology
In order to test whether an NMT system presents a locality bias in a controlled environment, we examine a setting of arbitrary absolute order of the source-side tokens. In this case, systems that are predisposed towards monotonic decoding are likely to present lower performance, while systems that have no predisposition as to the order of the target side tokens relative to the source-side tokens are not expected to show any change in per-formance. In order to create a controlled setting, where source-side token order is arbitrary, we extract fixed length sentences, and apply the same permutation to all of them. We then train systems with the permuted source-side data (and the same target-side data), and compare results to a control condition where no permutation is applied.
Concretely, we experiment on a German-English setting, extracting all sentences of the most common length (18) from the WMT2015 (Bojar et al., 2015) training data. This results in 130,983 sentences, of which we hold out 1,000 sentences for testing. It is comparable in training set size to a low-resource language setting.
We set a fixed permutation σ : [18] → [18] and train systems on three versions of the training data (settings): (1) REGULAR, to be used for control; (2) PERMUTED source-side, in which we apply σ over all source-side tokens; (3) PERPOSEMB where the positional embeddings of the sourceside tokens are permuted; 3 and (4) REVERSED, where tokens are input in a reverse order.
We apply the following permutation, σ, to the source-side tokens: We did not find any property that would deem this permutation special (examining, e.g., its decomposition into cycles). We therefore assume that similar results will hold for other σs as well.
We train a Transformer model, optimizing using Adam (Kingma and Ba, 2015). We set the embedding size to 512, dropout rate of 0.1, 6 stack layers in both the encoder and the decoder and 8 attention heads. We use tokenization, truecasing and BPE  as preprocessing, following the same protocol as (Yang et al., 2018).
We experiment both with learnedPEs, and with SinePEs. We train the BiLSTM model using the Nematus implementation (Sennrich et al., 2017b), and use their supplied scripts for preprocessing, training and testing, changing only the datasets used. For all models, we report the highest BLEU score on the test data for any epoch during training, and perform early stopping after 10 consecutive epochs without improvement.

Model
Positional Setting BLEU the other settings: 5 repetitions for PERMUTED, 1 for PERPOSEMB and 1 for REVERSED. In addition, we trained the BiLSTM model and the Transformer with SinePEs both in the REGULAR condition and in PERMUTED, each was trained once.

Results
Table 1 presents our results. We find that Nematus BiLSTM suffers substantially from permuting the source-side tokens, but that the Transformer does not exhibit a locality bias. Indeed, for learned-PEs in all settings (REGULAR, PERMUTED, RE-VERSED and PERPOSEMB), BLEU scores are essentially the same. We also find that the common practice of using fixed SinePEs does introduce some bias, as attested by the small performance drop between REGULAR and PERMUTED. Like Vaswani et al. (2017), we find that in the REGULAR settings, learnedPEs are not superior in performance to SinePEs, despite having more expressive power. However, our results suggest that the decision between learnedPEs and SinePEs is not without consequences: learnedPEs are preferable if a locality bias is undesired (this is potentially the case for highly divergent language pairs).

Discussion
Finding that Transformers do not present a locality bias has implications on how to construct their input in MT settings, as well as in other tasks that use self-attention encoders, such as image captioning (You et al., 2016). It is common practice to augment the source-side with globally-applicable information, e.g., the target language in multilingual MT (Johnson et al., 2017). Having no locality bias implies this additional information can be added at any fixed point in the sequence fed to a Transformer, provided that the positional embeddings do not themselves introduce such a bias. This is not the case with BiLSTMs, which often require introducing the same information at each input token to allow them to be effectively used by the system (Yao et al., 2017;Rennie et al., 2017).

LDD Challenge Sets
One of the stated motivations of the Transformer model is to effectively tackle long-distance dependencies, which are "a key challenge in many sequence transduction tasks" (Vaswani et al., 2017).
Our results from the previous section show that indeed fixed reordering patterns are completely transparent for Transformers. This, however, still leaves the question of how Transformers handle linguistic reordering patterns, which may involve varying distances between dependent tokens.

Methodology
We propose a method for scalably compiling challenge sets to support fine-grained MT evaluation for different types of LDD. We address two main types: Reordering LDD are cases where the words on the two sides of the parallel corpus largely correspond to one another, but are ordered differently. These cases may require attending to source words in a highly non-monotonic order, but the generation of each target word is localized to a specific region in the source sentence. For example, in English-German, the verb in a subordinated clause appears in a final position, while the verb in the English source appears right after the subject.
Consider "The man that is sitting on the chair", and the corresponding German "Der Mann, der auf dem Stuhl sitzt" (lit. the man, that on the chair sits) -while the verb is placed at different clause positions in the two cases, the words mostly have direct correspondents. Our methodology follows Birch (2011) in detecting such phenomena based on alignment. Concretely, we extract a word alignment between corresponding sentences, and collect all sentences that include a pair of aligned words in the source and target sides, whose indices have a difference of at least d ∈ N.
Lexical LDD are cases where the translation of a single word or phrase is determined by nonadjacent words on the source side. This requires attending to two or more regions that can be arbitrarily distant from one another. Several phenomena, such as light verbs (Isabelle and Kuhn, 2018), are known from the linguistic and MT literature to yield lexical LDD. Our methodology takes a predefined set of such phenomena, and defines rules for detecting each of them over dependency parses of the source-side. See §4.2 for the list of phenomena we experiment on in this paper.
Focusing on LDD, we restrict ourselves to instances where the absolute distance between the word and the dependent is at least d ∈ N. Selecting large enough d entails that the extracted phenomena are unlikely to be memorized as a phrase with a specific meaning (e.g., encode "make the whole thing up" [d = 3] as a phrase, rather than as a discontiguous phrase "make ... up" with an argument "the whole thing"). This increases the probability that such cases, if translated correctly, reflect the MT systems' ability to recognize that such discontiguous units are likely to be translated as a single piece.
We note, that by extracting the challenge set based on syntactic parses, we by no means assume these representations are internally represented by the MT systems in any way, or assume such a representation is required for succeeding in correctly translating such constructions. The extraction method is merely a way of finding phenomena we have a reason to believe are difficult to translate, and meaningful for language understanding. We use Universal Dependencies (UD; Nivre et al., 2016) as a syntactic representation, due to its cross-lingual consistency (about 90 languages are supported so far), which allows research on difficult LDD phenomena that recur across languages.
Our extraction methods resemble previous challenge set approaches (Isabelle et al., 2017;Isabelle and Kuhn, 2018;Ahrenberg, 2018), in using linguistically motivated sets of sentence pairs to assess translation quality. However, as our extraction method is fully automatic, it allows for the compilation of much larger challenge sets over many language pairs. The challenge sets we extract contain hundreds or thousands of pairs ( §4.2). The size of the sets allows using any MT evaluation measures to measure performance, and is thus a much more scalable solution than manual inspection, as is commonly done in challenge set approaches.
On the other hand, an automatic methodology has the side-effect of being noisier, and not necessarily selecting the most representative sentences for each phenomenon. For instance befinden sich (lit. to determine) includes a verb and a reflexive pronoun, which do not necessarily appear contiguously in German. However, as befinden always appears with the reflexive sich, it might not pose a challenge to NMT systems, which can essentially ignore the reflexive pronoun upon translation.

A Test Case on Extracting Sets
Next, we discuss the compilation of German-English and English-German corpora. We select these pairs, as they are among the most studied in MT, and comparatively high results are obtained for them (Bojar et al., 2017). Hence, they are more likely to benefit from a fine-grained analysis.
For the reordering LDD corpus, we align each source and target sentences using FastAlign (Dyer et al., 2013) and collect all sentences with at least one pair of source-side and target-side tokens, whose indices have a difference of at least d = 5. For example: Source: Wäre es ein großer Misserfolg, nicht den Titel in der Ligue 1 zu gewinnen, wie dies in der letzten Saison der Fall war? Gloss: Would-be it a big failure, not the title in the Ligue 1 to win, as this in the last season the case was? Target: In Ligue 1, would not winning the title, like last season, be a big failure?
We extract lexical LDD using simple rules over source-side parse trees, parsed with UDPipe (Straka and Straková, 2017). For a sentence to be selected, at least one word should separate the detected pair of words. We picked several wellknown challenging constructions for translation that involve discontiguous phrases: reflexive-verb, verb-particle constructions and preposition stranding. We note that while these constructions often yield lexical LDDs, and are thus expected to be challenging on average, some of their instances can be translated literally (e.g., amuse oneself is translated to amüsieren sich).
Reflexive Verbs. Prototypically, reflexivity is the case where the subject and object corefer. Reflexive pronouns in English end with self or selves (e.g., yourselves) and in German include sich, dich, mich and uns among others. However, reflexive pronouns can often change the meaning of a verb unpredictably, and may thus lead to different translations for non-reflexive instances of a verb, compared to reflexive ones. For example, abheben in German means taking off (as of a plane), but sich abheben means standing out. Similarly, in the example below, drängte sich translates to intrude, while drängte normally translates to pushed.
A source sentence is said to include a reflexive verb if one of its tokens is parsed with a reflexive morphological feature (refl=yes). Phrasal Verbs are verbs that are made up of a verb and a particle (or several particles), which may change the meaning of the verb unpredictably. Examples of English phrasal verbs include run into (in the sense of meet) and give in, and in German they include examples such as einladen (invite), consisting morphologically of the particle ein and the verb laden (load). A source sentence is said to include a phrasal verb if a particle dependent (UD labels of compound:prt or prt) exists in the parse. trat in itself means stepped, but in the extracted example below, trat. . . entgegen translates to received. Preposition Stranding is the case where a preposition does not appear adjacent to the object it refers to. In English, it will often appear at the end of the sentence or a clause. For example, The banana she stepped on or The boy I read the book to. Preposition stranding is common in English and other languages such as Scandinavian languages or Dutch (Hornstein and Weinberg, 1981). However, in German, it is not a part of standard written language (Beermann and Ik-Han, 2005), although it does (rarely) appear (Fanselow, 1983). We, therefore, extract this challenge set only with English as the source side. While preposition stranding is often regarded as a syntactic phenomenon, we consider it here a lexical LDD, since the translation of prepositions   Table 3: Sizes of Lexical LDD corpora. Challenge sets are partitioned (in order of appearance) by the language pairs, the phenomenon type, and the minimal distance between the head and the dependent. Phenomenon appears in the source. Statistics for the Newstest2013 corpora with miminal distance ≥ 1 are at the rightmost column, the rest are on Books.
(and in some cases their accompanying verbs) is dependent on the prepositional object, which in the case of preposition stranding, may be distant from the preposition itself. For example, translating the car we looked for into German usually uses the verb suchen (search), while translating the car we looked at does not. Translating prepositions is difficult in general (Hashemi and Hwa, 2014), but preposition stranding is especially so, as there is no adjacent object to assist disambiguation. A source sentence is said to include preposition stranding if it contains two nodes with an edge of the type obl (oblique) or a subcategory thereof between them, and the UD POS tag of the dependent is adposition (ADP

Experiments
We turn to evaluate SoTA NMT performance on the extracted challenge sets.
Experimental Setup. We trained the Transformer on WMT2015 training data (Bojar et al., 2015), for parameters see §3.1. For Nematus we used the non-ensemble pre-trained model from (Sennrich et al., 2017a). Each of the test sets, either a baseline or a challenge sets, for the Transformer and Nematus used a maximum of 10k and 1k sentences per set respectively. 4 Two parallel corpora were used for extracting the challenge sets. One is newstest2013 (Bojar et al., 2015) from the news domain that is commonly used as a development set for English-German. The other is the relatively unused Books corpus (Tiedemann, 2012) from the more challenging domain of literary translation. The corpora are of sizes 51K and 3K respectively. For lexical LDD, we took the distance (d) between the relevant words to be at least 1, meaning there is at least one word separating them. See Tables 2, 3 for the sizes of the extracted corpora.
For evaluation, we use the MOSES implementation of BLEU (Papineni et al., 2002;Koehn et al., 2007), and for reordering LDD, also RIBES (Isozaki et al., 2010), which focuses on reordering. RIBES measures the correlation of n-gram ranks between the output and the reference, where n-gram appears uniquely and in both.
Manual Validation. To assess the ability of our procedure to extract relevant LDDs, we manually analyzed over 180 source German sentences ex-   Results. Comparison of the overall BLEU scores of the NMT models (Table 4) against their performance on the challenge sets, shows that the phenomena are challenging for both models. Both in the small development set of newstest2013 and the large set of Books, the challenge subparts are more challenging across the board. For reordering LDD, we further apply RIBES and find a similar trend: RIBES score is lower for the reorder challenge set than the baseline (see Table 6).
In order to confirm that the distance between the head and dependent (the "length" of the depen-dency) is related to the observed performance drop in the case of lexical LDD, we partition each of the challenge sets according to their length (d), and compare the results to a control condition, where all instances of the phenomena listed in §4.2 are extracted, including non-LDD instances, i.e., sentences where the head and the dependent are adjacent. System performance on the sliced challenge sets (Table 5) shows that performance indeed decreases with d. Results thus indicate that it is not only the presence of the phenomena that make these sets challenging, but that the challenge increases with the distance.
We validate this main finding using manual annotation of German to English cases. Using two annotators (with high agreement between them; κ=0.79), we find that the decrease in performance with d is replicated. We measure how many of the detected lexical LDD are correctly translated, ignoring the rest of the source and output, as done in manual challenge set approaches. We find that 60%, 54% and 38% of the cases are translated correctly for d ∈ (1, 2, 5), respectively. This suggests that the extracted phenomena and the distance indeed pose a challenge, and that the automatic metric we use shows the correct trend in these cases. See Appendix 2 for details.
Discussion. Interestingly, these results hold true for the Transformer despite its indifference to the absolute word order. Therefore, word distance in itself is not what makes such phenomena challenging, contrary to what one might expect from the definition of LDD. It seems then that these phenomena are especially challenging due to the non-standard linguistic structure (e.g., syntactic and lexical structure), and the varying distances in which LDD manifest themselves. The models, therefore, seem to be unable to learn the linguistic structure underlying these phenomena, which may motivate more explicit modelling of linguistic biases into NMT models, as proposed by, e.g., Eriguchi et al. (2017) and Song et al. (2019).
We note that our experiments were not designed to compare the performance of BiLSTM and selfattention models. We, therefore, do not see the Transformer's inferior performance on Books, relative to Nematus as an indication of the general ability of this model in out-of-domain settings. What is evident from the results is that translating Books is a challenge in itself, probably due to the register of the language, and the presence of frequent non-literal translations.
A potential confound is that performance might change with the length of the source in BiLSTMs (Carpuat et al., 2013;Murray and Chiang, 2018), in Transformers it was reported to increase (Zhang et al., 2018). Length is generally greater in the challenge set than in the full test set, and generally increases with d, showing if anything a decrease of performance by length. To assess whether our corpora are challenging due to a length bias, we randomly sample from Books 1,000 corpora with 1,000, 100 and 10 sentences each. The correlation between their corresponding average length and the Transformers' BLEU score on them was 0.06,0.09 and 0.03 respectively. While this suggests length is not a strong predictor of performance, to verify that difficulty is not a result of the distribution of lengths in the challenge sets we conduct another experiment.
For each challenge set and each value of d (0-3), we sample 100 corpora. For each sentence in a given challenge set, we sample a sentence of no more than a difference of 1 in length. This results in a corpus with a similar length distribution, but sampled from the overall population of Books sentences. Results show that the BLEU score of the challenge sets in all German to English cases is lower than any randomly sampled corpus. 5 In the English-German cases, trends are similar, albeit less pronounced. This may be due to the low number of long English sentences, which lead to 5 Most sampled corpora actually had better scores than the baseline. We believe this is because very short sentences which are mostly noise, are never sampled. more homogeneous samples. Overall, results suggest that length is extremely unlikely to be the only cause for the observed trends.

Conclusion
As NMT system performance is constantly improving, more reliable methods for identifying and classifying their failures are needed. Much research effort is therefore devoted to developing more fine-grained and interpretable evaluation methods, including challenge-set approaches. In this paper, we showed that, using a UD parser, it is possible to extract challenge sets that are large enough to allow scalable MT evaluation of important and challenging phenomena.
An accumulating body of research is devoted to the ability of modern neural architectures such as LSTMs (Linzen et al., 2016) and pretrained embeddings (Hewitt and Manning, 2019;Liu et al., 2019;Jawahar et al., 2019) to represent linguistic features. This paper makes a contribution to this literature in confirming that the Transformer model can indeed be made indifferent to the absolute order of the words, but also shows that this does not entail that the model can overcome the difficulties of LDD in naturalistic data. We may carefully conclude then that despite the remarkable feats of current NMT models, inducing linguistic structure in its more evasive and challenging instances is still beyond the reach of stateof-the-art NMT, which motivates exploring more linguistically-informed models.

Acknowledgments
This work was supported by the Israel Science Foundation (grant no. 929/17) Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651-4659.