The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation

Does neural machine translation yield translations that are congenial with common sense? In this paper, we present a test suite to evaluate the commonsense reasoning capability of neural machine translation. The test suite consists of three test sets, covering lexical and contextless/contextual syntactic ambiguity that requires commonsense knowledge to resolve. We manually create 1,200 triples, each of which contain a source sentence and two contrastive translations, involving 7 different common sense types. Language models pretrained on large-scale corpora, such as BERT, GPT-2, achieve a commonsense reasoning accuracy of lower than 72% on target translations of this test suite. We conduct extensive experiments on the test suite to evaluate commonsense reasoning in neural machine translation and investigate factors that have impact on this capability. Our experiments and analyses demonstrate that neural machine translation performs poorly on commonsense reasoning of the three ambiguity types in terms of both reasoning accuracy ( 6 60.1%) and reasoning consistency (6 31%). We will release our test suite as a machine translation commonsense reasoning testbed to promote future work in this direction.


Introduction
Sixty years ago, the pioneering machine translation researcher and linguist Bar-Hillel published his well-known argument on the non-feasibility of general-purpose fully automatic high-quality machine translation (FAHQT) due to the inevitable requirement of world knowledge to help machine translation to infer correct translations for ambiguous words or linguistic structures (Bar-Hillel, 1960a). The example that Bar-Hillel uses as an * Equal Contributions. evidence for the need of commonsense knowledge in machine translation is "The box is in the pen", where machine translation is expected to perform reasoning on the relative sizes of "box" and "pen". Bar-Hillel also doubts that a machine, even equipped with extra-linguistic knowledge, would be able to reason with such knowledge spontaneously as human translators do (Bar-Hillel, 1960a;Macklovitch, 1995).
Modern natural language processing (NLP) has made tremendous progress, not only in building abundant resources to develop linguistic insights, but also in plenty of methodological practices. On the one hand, machine translation has been substantially advanced with large-scale parallel data and statistical models. Recent results even suggest that the quality of machine-generated translations is approaching professional human translators (Wu et al., 2016;Hassan et al., 2018). On the other hand, a wide variety of efforts have been conducted to either examine the commonsense reasoning capability of neural models in natural language understanding, establish commonsense reasoning challenges or enhance neural models in commonsense reasoning Talmor et al., 2018;Huang et al., 2019;Sap et al., 2019b).
Comparing with Bar-Hillel's doubts and recent progress on machine translation and commonsense reasoning, it is natural for us to ask questions: do we solve the machine translation impasse related to commonsense reasoning? Or particularly, are current neural machine translation models able to learn common sense? And if so, how much do they learn? Does neural machine translation acquire sufficient commonsense knowledge and have strong ability in commonsense reasoning to generate human-level high-quality translations? Methodological discussion on the feasibility of FAHQT given the recent progress is far beyond the scope of this work. Instead, we focus on empirically ana-(1) 这个 人 戴 的 表 走 了 3 分钟 。 The watch worn by this person went/walked for 3 minutes.
lyzing the capability of state-of-the-art neural machine translation models in using extra-linguistic commonsense knowledge to resolve ambiguity at different linguistic levels and select correct translations after disambiguation.
In order to achieve this goal, we manually build a machine translation commonsense reasoning test suite on Chinese-to-English translation with three types of commonsense-related ambiguities: lexical ambiguity, contextless and contextual syntactic ambiguity (see Section 3.1 for more details). Examples are shown in Figure 1. With this test suite, we thoroughly evaluate the commonsense reasoning ability of state-of-the-art neural machine translation models, e.g., LSTM-and Transformer-based NMT (Bahdanau et al., 2015;Vaswani et al., 2017). We also conduct analyses on the commonsense reasoning capability according to commonsense knowledge types, sentence length and reasoning consistency and the size of training data.
To the best of our knowledge, this is the first work to understand and measure the commonsense reasoning capability in neural machine translation. The contributions of this paper can be summarized as follows: • We build a test suite 1 to examine the ability of neural machine translation in commonsense reasoning, which provides a benchmark testbed for tracking progress in this direction. • Based on our experiments and analyses on evaluating commonsense reasoning in NMT, we find that: 1) commonsense reasoning related to lexical ambiguity and contextual syntactic ambiguity is more difficult than contextless syntactic ambiguity; 2) although the commonsense reasoning accuracy is higher than 50%, the reasoning consistency rate is far lower than 50% (random guess).

Related work
We briefly review recent efforts related to commonsense reasoning in NLP. We refer readers to Storks et al. (2019)'s article for a thorough survey in this area.

Commonsense Datasets
According to Gunning (2018), commonsense knowledge normally consists of a general theory of how the physical world works and a basic understanding of human motives and behaviors. In recent years, a wide variety of datasets on the two kinds of commonsense knowledge have been proposed. Sap et al. (2019b) (Speer et al., 2016)) in machine reading comprehension.

Commonsense Reasoning Evaluation
With pre-trained language models, like BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019) being widely used in various NLP tasks, studies have been performed to examine the commonsense reasoning capability in pre-trained neural language models.  and Zhou et al. (2020) propose to measure the success rate of the pretrained language models in commonsense inference by calculating LM probabilities. Two sentences which are used to test commonsense inference differ only in commonsense concepts. Feldman et al. (2019) further explore unsupervised methods to generate commonsense knowledge using the world knowledge of pre-trained language models. Our commonsense reasoning evaluation resonates with these evaluation efforts.

Commonsense Knowledge and Reasoning in Machine Translation
Commonsense knowledge has long been acknowledged as an indispensable knowledge source for disambiguation in machine translation (Bar-Hillel, 1960b;Davis and Marcus, 2015). Knowledgebased machine translation (KBMT), one of the popular machine translation paradigms in 1980s, lays much stress on extra-linguistic world knowledge in machine translation (Nirenburg, 1989). Large ontology that is constructed either manually or automatically to provide world knowledge is one of essential components in KBMT (Knight and Luk, 1994). As data-driven machine translation, such as statistical machine translation (SMT) and neural machine translation, becomes de facto standard in machine translation, world knowledge has been less explicitly explored. Only a few studies have indirectly and partially exploited world knowledge in SMT or NMT, by incorporating linked open data resources such as DBpedia and BabelNet into SMT with modest improvements (Du et al., 2016;Srivastava et al., 2017;Moussallem et al., 2018).

Commonsense Reasoning Test Suite for Machine Translation
In this section, we discuss the design and construction of the test suite, including the rules and steps for building this test suite.

Test Suite Design
Different from commonsense reasoning in Winogram Schema Challenge (Levesque et al., 2012) or sentence reasonability judgment (i.e., "He put a turkey into the fridge" vs. "He put an elephant into the fridge") , where commonsense reasoning normally happens in one language, commonsense reasoning in NMT can be done either in the encoding of the source language (i.e., encoding reasonable source representations) or in the decoding of the target language (i.e., producing reasonable target outputs). As it is difficult to detect whether reasonable senses are identified and encoded in the encoder, we check target outputs from the decoder to test the commonsense reasoning capability of NMT. This is the first rule that we follow to design the test suite.
In the second rule for building the test suite, we manually create source sentences with ambiguity that requires commonsense reasoning. Inspired by Schwartz and Gomez (2009) and Ovchinnikova (2012), we ground the commonsense reasoning test on two types of ambiguity: lexical and syntactic ambiguity (LA and SA), which are common in machine translation. An example in LA is the "batter" in "she put the batter in the refrigerator" (food material vs. baseball player). SA relates to structures, for instance, "I saw a man swimming on the bridge" (I was standing on the bridge vs. The man was swimming on the bridge). We further refine SA into contextless (e.g., Example (2) in Figure 1) and contextual SA (e.g., Example (3) in Figure 1). The former can be correctly interpreted by resorting to commonsense knowledge while the latter cannot be interpreted uniquely if no more context is given.
The third rule that we conform to is to 1) create two contrastive source sentences for each lexical or syntactic ambiguity point, where each source sentence corresponds to one reasonable interpretation of the ambiguity point, and 2) to provide two contrastive translations for each created source sentence. This is similar to other linguistic evaluation by contrastive examples in the MT literature (Avramidis et al., 2019;Bawden et al., 2018;Müller et al., 2018;Sennrich, 2017). These two contrastive translations have similar wordings: one is correct and the other is not correct in that it translates the ambiguity part into the corresponding translation of the contrastive source sentence. This translation makes sense in the contrastive sentence but not in the sentence in question. Examples of contrastive source sentences and contrastive translations for each source sentence are shown in Figure 2, 3 and 4.
The main force has already launched an attack on the enemy's building. 1 The main force has already launched a research on the enemy's building.
2 After two years of attack, this technical problem has finally been solved. Finally, we have hired two linguistic experts to construct ambiguous source sentences and two professional human translators to provide contrastive translations for each source sentence. We ask them to create and translate with diverse words as much as possible and hire an extra linguistic expert and translator to review and double check source sentences and target translations after the two experts and translators cross check with each other.

Lexical Ambiguity Test Set
To construct this test set, we select words from a Chinese polysemous dictionary 2 so that the selected words have multiple interpretations. We avoid selecting words that are semantically close to one another in order to maintain diversity of the test set. We do not select words that are polysemous in Chinese but translated into the same words in English. Words that are translated into very different English words in different context and require commonsense knowledge to disambiguate are preferred.
This test set contains 200 example blocks. Each block is composed of two contrastive triples (z 1 , e r 1 , e c 1 ) and (z 2 , e r 2 , e c 2 ). As shown in Figure 2, z 1 and z 2 are contrastive with each other as they contain the same ambiguous word with different meanings. e r . and e c . are contrastive translations where the former is correct while the latter not. e c 1 and e c 2 are wrong translations in that they incorrectly interpret the ambiguous word in the way of e r 2 and e r 1 respectively. A selected polysemous word is used in only one example block.

Syntactic Ambiguity Test Sets
As mentioned before, we have two types of test sets for syntactic ambiguity: contextless and contextual 2 Download link for the Chinese polysemous dictionary z 1 维修 桌子 的 桌脚 。 e 1 r Repair the legs of the   SA. Before we construct the two test sets, we select Chinese structures that are typically ambiguous, just like PP attachment in English (e.g., "He ate the apple in the refrigerator" from Schwartz and Gomez (2009)). Feng (1995) has deeply investigated syntactic ambiguity in Chinese and has found 26 structures that tend to generate sentences with different interpretations, such as "noun phrase + de (a Chinese particle) + shi (is) + noun phrase". From them, we use 12 structures to construct contrastive examples, where the subtle differences in Chinese can be clearly detected in English after translation.
With these 12 structure templates with potential syntactic ambiguity, we manually create 225 example blocks for the contextless SA test set and 175 blocks for the contextual SA test set. Examples of these two test sets are listed in Figure 3 and 4. Similar to the LA test set, each block is composed of two contrastive triples where two translations for each source sentence are also contrastive with each other in the way that we translate sentences in the LA test set. For the blocks in the contextless test set, we make sure that each ambiguous source sentence can be correctly interpreted with commonsense knowledge. We do not need extra context information for disambiguation. In con- Behaviors that objects will take in a particular situation 鸡/ chicken 不/not 吃了/eat 因为/because 这只鸡/the chicken 已 经/had already 吃了/eat 太多了/too much.

25.2
Taxonomy Systematic classification of objects and concepts 今年/this year 风调雨顺/weather is good 农民的秋景/the harvest of the farmers' autumn 一定/must be 很好/very good.

21.1
Action Some actions an object may be involved in 健康的/ healthy 医生/doctor 正在/is doing 手术/surgery.

Structures
Object A is part of Object B 削/Cut 西瓜的/the watermelon 皮/skin.

Test Suite Analysis
We provide statistical analyses on the built test suite, which cover its size, distribution of knowledge types and the reasoning accuracy of pretrained language models on target translations of target translations of this test suite.

General Statistics
Statistics on the built test suite are displayed in Table 1. We show the number of triples, the number of unique tokens, and the average number of tokens per sentence in each test set. Although sentences in the test suite are not very long, they are very challenging to be correctly translated as commonsense reasoning is involved, which will be verified in our experiments.

Evaluation of Pretrained Language Models on the Test Suite
In our test suite, we find that target translations of 93.7% instances (1,124 of 1200 test instances) can be determined if they are correct only from translations themselves (i.e., by performing commonsense reasoning), without reference to the corresponding source sentences. This is exactly what we want the test suite to be like as the purpose of this test suite is to evaluate commonsense reasoning rather than the ability of NMT in exploring source context for translation disambiguation not related to common sense. This is also consistent with the first rule for building the test suite: evaluating commonsense reasoning from the target side. Since the reasonability of these translations can be determined only from themselves, we want to know how challenging they are for pretrained language models in terms of commonsense reasoning. Hence, we evaluate state-of-the-art language models pretrained on large-scale data, including BERT (Devlin et al., 2019), GPT (Radford, 2018), and GPT-2 (Radford et al., 2019), on these 1,124 translation pairs (pairs of reference and contrastive translations). For notational convenience, we still use the test suite to refer to these instances as only 76 cases are excluded for this evaluation. Following  and Zhou et al. (2020), for each pair (e r , e c ), we use a pretrained language model to compute the language model score of the two translations. The translation with a higher score is labelled as the correct one by the language model. By comparing these labels with ground-truth labels, we can obtain the commonsense reasoning accuracy of the corresponding language model on these instances.
Results are shown in Table 3. All language models are better than random guess, validating the commonsense reasoning ability of them. They perform worse on the contextual SA test than on the other two test sets, demonstrating the difficulty in cross-sentence commonsense reasoning. BERTlarge achieves the highest accuracy, 0.712. The number of parameters of BERT-large is equal to that of GPT2-medium, almost 3 times as large as that of GPT-2 base and BERT-base (340M vs. 117M). We conjecture that the reason for the superiority of BERT models over GPT/GPT-2 models is due to bidirectional context in BERT, which resonates with the findings of Zhou et al. (2020). The accuracies of all pretrained language models are all lower than 72%. This suggests that our test suite is very challenging in commonsense reasoning even for language models trained on an amount of data.

Experiments
In this section, we conducted extensive experiments to evaluate the commonsense reasoning capability of state-of-the-art neural machine translation on the built test suite.

Experimental setup
We adopted the CWMT Chinese-English corpus 3 of news domain as training data for NMT systems. This corpus contains 9M parallel sentences. We used byte pair encoding compression algorithm (BPE) (Sennrich et al., 2016) to process all these data and restricted merge operations to a maximum of 30k.
We trained two neural machine translation models on the training data: RNNSearch (Bahdanau et al., 2015) and Transformer (Vaswani et al., 2017). 3 Available at: http://nlp.nju.edu.cn/cwmt-wmt We used the Transformer base model with 6 layers and 8 self-attention heads per layer. As for RNNSearch, we employed neural architecture with 4 layers of LSTM and 512-dimension hidden states. We used Adam (Kingma and Ba, 2015) to train both NMT models. β1 and β2 of Adam were set to 0.9 and 0.999, the learning rate was set to 0.0005, and gradient norm was set to 5. To take full advantage of GPUs, we batched sentences of similar lengths. We trained both models on a single machine with 8 1080Ti cards. Each mini-batch contained 32,000 tokens. During decoding, we employed the beam search algorithm and set the beam size to 5.
To evaluate the commonsense reasoning accuracy of NMT on the test suite, we applied NMT models to score each pair (s, t) as follows: where p(t i |t <i , s) is the probabilty of the target word t i given the target history and source sentence. Given a triple (z, e r , e c ), if an NMT model scores the reference translation higher than the contrastive translation (i.e., Score(e r |z) > Score(e c |z)), the NMT model is believed to make a correct commonsense reasoning prediction. This is reasonable as e r and e c are only different in words or structures related to the lexical or syntactical commonsense ambiguity point as described in Section 3.1. By scoring each triple with an NMT model, we can measure the commonsense reasoning accuracy of the model on our test suite.

Results
BLEU scores for the two NMT models are given in Table 4. Commonsense reasoning results on the test suite are provided in Table 5. From the table and figure, we can observe that • Both BLEU and commonsense reasoning accuracy clearly show that Transformer is better than RNNSearch. • Both RNNSearch and Transformer perform better on the contextless SA than on the contextual SA according to the commonsense reasoning accuracy. This is consistent with the results of pretrained language models shown in    and RNNSearch on the CL-SA test set is larger than that on the other two test sets. The reason might be that the self-attention mechanism allows Transformer to more easily detect collocations (e.g., "leg" and "table" in Figure  3) for disambiguation on the CL-SA test set. Many CL-SA cases can be disambiguated by collocations according to our observation on this test set. • Compared with the relative BLEU improvement of Transformer over RNNSearch, the relative improvement in terms of commonsense reasoning accuracy is smaller (8.2% vs. 18.91% in BLEU), indicating that more efforts are expected to not only improve translation quality in terms of BLEU but also to enhance commonsense reasoning ability in NMT.

Effect of the Size of Training Data
We conducted experiments to investigate the impact of the amount of training data on the commonsense reasoning performance of the state-of-the-art NMT model Transformer. Results are displayed in Figure  5. Generally, with the increase of training data, The common-sense reasoning ability of NMT systems  rises too. Although we used all CWMT Chinese-English training data to train NMT, we didn't have a chance to see that the commonsense reasoning accuracy tends to level off. We conjecture that the growth has the potential to continue. We leave using more data to measure the growth momentum of NMT commonsense reasoning to our future work. Yet another finding from Figure 5 is that the commonsense reasoning performance on the contextless SA test set is always higher that the other two test sets. As shown in the last subsection, the reasons for this may be due to shorter sentences and collocations in this test set.

Effect of Sentence Length
We carried out an analysis on the impact of the length of source sentences on commonsense reasoning. We divided the test suite into 5 groups according to the length of source sentences. The results are shown in Figure 6. Generally, Transformer is better than RNNSearch in almost all length groups. With the length of source sentences increasing, the commonsense reasoning performance tends to go down. This may suggest that long-distance or crosssentence commonsense reasoning is more challenging for NMT than short-distance reasoning, which is consistent with our finding on the CL-SA test set.

Effect of Commonsense Knowledge Types
Finally, we analyzed the commonsense reasoning capability of Transformer on different commonsense knowledge types. Studying different types of common sense can help us understand what kind of commonsense knowledge is more needed to solve commonsense reasoning problems in NMT. The results are shown in Figure 7. Transformer-based NMT obtains relatively good results on commonsense reasoning on properties, structures, actions, but performs badly on reasoning on behaviors and emotions.
6 Further Analysis

Analysis on Reasoning Consistency
Our test suite contains 600 example blocks, each of which focuses on only one LA/SA ambiguity point. For the two reasonable interpretations (z 1 , z 2 ) of a given ambiguity point, NMT models need to make two translation predictions: one for (e r 1 , e c 1 ) and the other for (e r 2 , e c 2 ). If they choose e r 1 and e r 2 (both right reasoning predictions) or e c 1 and e c 2 (both wrong reasoning predictions), we treat this as a consistent reasoning, otherwise inconsistent. Partially inspired by Zhou et al. (2020), we conducted an analysis on reasoning consistency.
We counted the times that a tested NMT model made consistent reasoning predictions and calculated the consistency rate on the three test sets. Results are shown in Table 6. Disappointingly, the reasoning consistency rates for both RNNSearch and Transformer are lower than random guess (0.5).
On the contextless SA test set where both NMT models have higher reasoning accuracies, the rates of reasoning consistency are also higher than those of the other two test sets.

Analysis on Translation Errors
We have already automatically evaluated commonsense reasoning in NMT with both reasoning accuracy and reasoning consistency rate. We further manually analyzed the translation errors of Transformer on the entire test suite. We roughly grouped translation errors into three categories: common sense errors (translations that are not consistent with common sense), ordinary meaning errors (wrong translations of sources words that are not commonsense ambiguity points) and other errors (e.g., missing words). These errors were manually detected and labeled by two annotators. They checked all examples in the test suite. The interannotator agreement, measured as the rate of the number of labels that the two annotators annotate consistently against the total number of labels from the two annotators, is 92%.
Results are reported in Table 7. The majority of translation errors are indeed related to common sense (71.6%). This suggests that our test suite is a suitable and challenging testbed for evaluating commonsense reasoning in NMT.

Conclusion
In this paper, we have presented a test suite, including a lexical ambiguity test set and two syntactic ambiguity test sets, to evaluate the commonsense reasoning capability of state-of-the-art neural machine translation models. We elaborate the rules of building this test suite and conduct statistical analyses on it. Our evaluation experiments and analyses on this test suite suggest that commonsense reasoning in modern machine translation models is still in its infant stage and that more efforts are to be expected to advance NMT in this direction.