Data and Representation for Turkish Natural Language Inference

Large annotated datasets in NLP are over-whelmingly in English. This is an obstacle to progress in other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. At the same time, commercial machine translation systems are now robust. Can we leverage these systems to translate English-language datasets automatically? In this paper, we offer a positive response for natural language inference (NLI) in Turkish. We translated two large English NLI datasets into Turkish and had a team of experts validate their translation quality and ﬁdelity to the original labels. Using these datasets, we address core issues of representation for Turkish NLI. We ﬁnd that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large. Finally, we show that models trained on our machine-translated datasets are successful on human-translated evaluation sets. We share all code, models, and data publicly.

Unfortunately, outside of parsing and MT, these datasets tend to be in English. This is not only an obstacle to progress on other languages, but it also limits the field of NLP itself: English is generally not a representative example of the world's languages when it comes to morphology, syntax, or spelling conventions and other kinds of standardization (Munro, 2012), so it's risky to assume that models and results for English will generalize to other languages.
A natural response to these gaps in our dataset coverage might be to launch new annotation efforts for multiple languages. However, this would likely be prohibitively expensive. For example, based on the costs of SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018a), we estimate that each large dataset for NLI would cost upwards of US $50,000 if created completely from scratch.
At the same time, commercial MT systems have improved dramatically in recent years (Wu et al., 2016;Johnson et al., 2017;Hieber et al., 2017Hieber et al., , 2018Tomasello, 2019;Hieber et al., 2020). They now offer high-quality translations between hundreds of language pairs. This raises the question: can we use these MT systems to translate Englishlanguage datasets and use the translated versions to drive more genuinely multilingual development in NLP? In this paper, we offer evidence that the answer is "yes".
Using Amazon Translate, we translated SNLI and MultiNLI from English into Turkish to create the first large Turkish NLI data sets, NLI-TR, at a tiny fraction of the cost of creating them from scratch. Turkish is an interesting challenge in this context since it is very different from English, most notably in its very free word order and complex morphology. A word in Turkish bears morpho-syntactic properties in the sense that phrases formed of several words in languages like English can be expressed with a single word form.
In our validation phase (Section 3), a team of Turkish-English bilingual speakers assessed the quality of a large sample of the translations in NLI-TR. They found the quality to be very high, which suggests that translated datasets can provide a foundation for NLI research on a resourceconstrained language, even if it has significantly different characteristics from English.
We then use these datasets to study the roles of pre-trained language models and morphological parsing in successful NLI systems for Turkish (Section 4). For these experiments, we fit classifiers on top of pre-trained BERT parameters (Devlin et al., 2019) and compare the original BERT-base release, the multilingual BERT embeddings released by the BERT team, and the Turkish BERT (BERTurk) embeddings of Schweter (2020). We find BERTurk to be superior to the others for NLI-TR.
Morphological parsing is a natural preprocessing step for Turkish due to its complex morphology. Thus, we assess the use of three morphological parsers as the second case study: Zemberek (Akın and Akın, 2007), BOUN parser (Sak et al., 2011), andTurkish Morphology (Öztürel et al., 2019). We find that the parsers help where training data is sparse, but the need for a parser disappears as the training data increases. This is a striking finding: one might expect that Turkish would require morphological parsing given its complex word-formation processes. It might be regarded as welcome news, though, since the parsers are expensive to run. In Section 4.2, we report on some new optimizations of existing tools to make the relevant parsing jobs feasible, but we would still like to avoid these steps if possible, and it seems that we can for NLI.
Finally, we investigate how models trained on the machine translated datasets perform on the human translations from XNLI (Conneau et al., 2018). We find that machine translated and human translated sentences yield similar results, suggesting that it is safe to apply models trained on machinetranslated datasets to human-written sentences.

Related Work
Early in the development of textual entailment tasks,  argued for multilingual versions of them. This led to subsequent explorations of a variety of techniques, including crowdsourcing translations , relying on parallel corpora to support reasoning across languages , and automatically translating datasets using MT systems Real et al., 2018;Rodrigues et al., 2020). This research informed Se-mEval tasks in 2012 (Negri et al., 2012) and 2013 (Negri et al., 2013) followed by ASSIN 1 (Fonseca et al., 2016) and 2 (Real et al., 2020) shared tasks exploring the viability of multilingual NLI.
From the perspective of present-day NLI models, these datasets are very small, but they could be used productively as challenge problems.
More recently, Conneau et al. (2018) reinvigorated work on multilingual NLI with their XNLI dataset. XNLI provides expert-translated evaluation sets from English into 14 other languages, including Turkish. Though they are valuable resources to push NLI research beyond English, test sets alone are insufficient for in-language training on target languages, which is likely to lower the performance of the resulting systems.
Although it was not the main focus of the XNLI effort, Conneau et al. (2018) distributed machine translations of MultiNLI into other languages, including Turkish, which we call MultiNLI-TR XNLI in this paper. The translations helped them form a strong baseline for their cross-lingual models, which proved superior in their assessments. However, the quality of the translations is crucial, as the authors note. Our hope for NLI-TR is that it supports effective in-language training.
XNLI's primary focus on test sets rather than training is justified by a wide body of recent results on cross-lingual transfer learning. Multilingual embeddings (embeddings trained on multilingual corpora) have played an important role in these developments. The BERT team (Devlin et al., 2019) released multilingual embeddings and demonstrated their value using XNLI. At the same time, BERT models have been released for a variety of individual languages (see Wolf et al., 2019) and specialized domains (Alsentzer et al., 2019;Lee et al., 2020). While we might expect the language-and domain-specific embeddings to be superior for the kind of data they were trained on, the multilingual versions might be more efficient in large-scale deployments in diverse environments. Balancing these trade-offs is challenging. Here, we offer some insight into these trade-offs for Turkish.
Turkish is a morphologically-rich language in which new word forms are freely created using suffixation. Several morphological parsers (Akın and Akın, 2007;Öztürel et al., 2019;Sak et al., 2009) and morphological disambiguation systems (Akın and Akın, 2007;Sak et al., 2011) have been developed for Turkish. The state-of-the-art morphological analyzers can parse with success rates around 95%. We use three of these parsers in this work to evaluate the role of morphology in NLI systems (Section 4.2).
3 Creating and Validating NLI-TR

English NLI Datasets
We translated the Stanford Natural Language Inference Corpus (SNLI;Bowman et al., 2015) and the Multi-Genre Natural Language Inference Corpus (MultiNLI; Williams et al., 2018b) to create labeled NLI datasets for Turkish, NLI-TR.
SNLI contains ≈570K semantically related English sentence pairs. The semantic relations are entailment, contradiction, and neutral. The premise sentences for SNLI are image captions from the Flickr30K corpus (Young et al., 2014), and the hypothesis sentences were written by crowdworkers. SNLI texts are mostly short and structurally simple. We translated SNLI while respecting the train, development (dev), and test splits.
MultiNLI comprises ≈433K sentence pairs in English, and the pairs have the same semantic relations as SNLI. However, MultiNLI spans a broader range of genres, including travel guides, fiction, dialogue, and journalism. As a result, the texts are generally more complex than SNLI. In addition, MultiNLI contains matched and mismatched dev and test sets, where the sentences in the former set are from the same sources as the training set, whereas the latter consists of texts from different genres than those found in the training set. We translated the training set and both dev sets for NLI-TR.

Automatic Translation Effort
As we noted in Section 1, Turkish is a resourceconstrained language with few labeled data sets compared to English. Furthermore, Turkish has a fundamentally different grammar from English that could hinder transfer-learning approaches. These facts motivate our effort to translate SNLI and MultiNLI from English to Turkish. We employ an automatic MT system and hope that it will deliver high-quality translations that we can use for NLI research and system development in Turkish.
We used Amazon Translate, a commercial neural machine translation service. Translation of all folds of SNLI and MultiNLI cost just US $2K (vs. the ≈US $100K we would expect for replicating these two datasets from scratch) and five days with no parallelization. We refer to the translated datasets as SNLI-TR and MultiNLI-TR, and collectively as NLI-TR. Translation examples are provided in Table 1. We publicly share NLI-TR. 1 SNLI-TR and MultiNLI-TR are different from SNLI and MultiNLI in terms of token counts and vocabulary sizes. Table 2 illustrates these features before and after translation. For each fold in each dataset, translation decreased the number of tokens in the corpus, but it increased the vocabulary sizes drastically, in both the cased and uncased versions. Both of these differences are expected: many multiword expressions in English are translated into individual words due to the agglutinating nature of Turkish. For instance, the four-word English expression "when in your home" can be translated to the single word "evinizdeyken". Table 2 also reflects the complexity difference between SNLI and MultiNLI that we noted in Section 3.1. Though SNLI contains more sentence pairs than MultiNLI, it has fewer tokens and a smaller vocabulary.

Translation Quality Assurance
Two major risks arise when using MT systems to translate NLI datasets. First, the translation quality might be low. Second, even if the individual sentences are translated correctly, the nature of the mapping from the source to the target language might affect the semantic relations between sentences. For example, English has the words "boy" and "girl" to refer to male and female children, and both those words can be translated to a genderneutral Turkish word "çocuk". Now, consider a premise sentence "A boy is running" and its contradiction pair "A girl is running". Both sentences can be translated fluently into the same Turkish sentence, "Ç ocuk koşuyor", which changes the semantic relation from contradiction to entailment.
Thus, to determine the viability of NLI-TR as a tool for NLI research, we must assess both translation quality and the consistency of the NLI labels. To do this, we assembled a team of ten Turkish-English bilingual speakers who were familiar with the NLI task and were either MSc. candidates or graduates in a relevant field.
For expert evaluation, we grouped the translations into example sets of four sentences as in Table 1, where the first sentence (premise) is semantically related to the rest (hypotheses). We distributed the sets to the experts so that each set (and sentence) was examined by five randomly chosen experts and each expert co-examined approx-

English Turkish Premise
Three men are sitting near an orange building with blue trim.Ü ç adam mavi süslemeli turuncu bir binanın yanında oturuyor.

Neutral
Three males are seated near an orange house with blue trim and a blue roof.Ü ç erkek mavi süslü ve mavi çatılı turuncu bir evin yakınında oturuyor.  imately the same number of sets with each other expert. Each expert evaluated the translation by (i) grading the translation quality between 1 and 5 (inclusive; 5 the best) and (ii) checking if the translation altered the semantic relation. We distributed an annotation guide 2 to the team to standardize the criteria. In total, 500 example sets (2,000 translated sentences) were examined by five experts, yielding 10,000 annotations. We use the average translation score of the annotations to estimate translation quality. For label consistency, there are two comparisons we can make, since we have five new annotations per example. The annotation-level analysis compares each new annotation with the gold label on the original English example. The majority-level analysis compares only the majority label (if any) of the five new annotations with the English gold label. The annotation-level analysis is more stringent, whereas the majority-level analysis directly connects with how we expect NLI-TR to be most commonly used. Table 3 reports these analyses for SNLI and MultiNLI. The results are extremely reassuring. First, average translation quality is near 5 2 https://github.com/boun-tabi/NLI-TR (ceiling) for all the splits. Second, annotation-level label consistency is over 90% and majority-level label consistency is over 95%, indicating that the linguistic differences between English and Turkish are not a major issue for preserving NLI labels.
To assess the reliability of the translation quality scores, we calculated the Intra-Class Correlation (ICC; McGraw and Wong 1996). ICC is frequently adopted in medical studies to assess ordinal annotations provided by experts randomly drawn from a team. Its assumptions align well with our evaluation scheme. We obtained an ICC of 0.8426, which suggests excellent agreement (Cicchetti, 1994;Hallgren, 2012).
We also computed Krippendorff's alpha (Krippendorff, 1970), which is an inter-annotator agreement metric used more commonly in NLP. This metric is suitable for both nominal and ordinal annotations involving multiple annotators. We calculated intercoder reliability of the ordinally-scaled translation quality score as 0.47. Our annotationlevel label consistency yielded a score of 0.78 whereas our majority-level label consistency resulted in a score of 0.99. In contrary to the perfect agreement in the majority-level label consistency, the Krippendorff's alpha values of annotation-level  Table 3: Translation quality and label consistency of the translations in SNLI-TR and MultiNLI-TR based on expert judgements. For the quality ratings (1-5), we report mean and standard deviation (in parentheses). For label consistency, we report the percentage of labels in SNLI-TR and MultiNLI-TR judged consistent with the original label, both in annotation-and sentence-level.
labels and translation quality scores suggest less overall agreement than our ICC values do, but they are still acceptable, and ICC is arguably the more appropriate metric for our study. Krippendorff's alpha is generally used for large, diverse annotation teams, and its penalties for disagreements are known to be harsh.
Overall, it seems that the very high estimates of translation quality and label consistency of NLI-TR are trustworthy, and only a small percentage of premise-hypothesis have inconsistent semantic labels between their original and translated forms. Still, we would like to better understand why inconsistencies do arise. To this end, we inspected all 49 label-inconsistent pairs in our annotations. We find that low translation quality is the leading source of such errors, which further emphasizes how essential it is to work with high-quality translations.
Of the label-inconsistent pairs with good translations, we find that about 20 probably trace to differing perspectives on how to apply the NLI annotation guidelines. Relatedly, Conneau et al. (2018) find that NLI labels often cannot be completely recovered by different annotators even with no sentence modifications.
Finally, we did find one example of label inconsistency that traces to a subtle difference between the English and Turkish lexicons. In this example, the premise "Your speeches are inflammatory" was translated to Turkish as "Konuşmalarınız çok kışkırtıcı", which can be back-translated as "Your speeches are provocative", while its entailment hypothesis "Your speeches upset people" was translated as "Konuşmaların insanlarıüzüyor", equivalent to "Your speeches make people sad". An-notators agreed that both of these translations are of maximum quality, but also stated that the Turkish pair should be labeled neutral. As bilingual speakers, we feel that this is essentially correct; the relevant English and Turkish adjectives are subtly different in ways that affect the NLI label. However, such examples seem to be rare and so pose minimal risk for conducting research using NLI-TR.

Case Study I: Comparing BERT models on Turkish NLI Datasets
The arrival of pre-trained model-sharing hubs (e.g., Tensorflow Hub, 3 PyTorch Hub, 4 and Hugging Face Hub 5 ) has democratized access to Transformer-based models (Vaswani et al., 2017), which are mostly in English. Combined with the abundance of labeled English datasets for finetuning, this has increased the performance gap between English and resource-constrained languages.
Here, we use NLI-TR to analyze the effects of pretraining Transformer-based models. We compare three BERT models trained on different corpora by fine-tuning them on NLI-TR. The results quantify the importance of having high-quality, language-specific resources.

Experimental Settings
We compared cased BERT-English (BERT-En), BERT-Multi, and BERTurk (Schweter, 2020). BERT-En is the original BERT-base model released by Devlin et al. (2019), which used an English-only corpus for training. BERT-Multi was released by the BERT team as well, and was trained on a corpus containing texts from 104 languages, including Turkish. Schweter's BERTurk also uses the same model architecture and is trained on a Turkish corpus (≈30GB).
We fine-tuned each model on train folds of NLI-TR separately and fixed the maximum sequence length to 128 for all experiments. Similarly, we used a common learning rate of 2 × 10 −5 and batch size of 8 with no gradient accumulation. We fine-tuned each model for 3 epochs using Hugging-Face's Transformers Library (Wolf et al., 2019). We evaluated the models on the test set of SNLI-TR and the matched and mismatched dev splits of MultiNLI-TR. Table 4 reports the accuracy of each model on the evaluation sets. Table 4 demonstrates that NLI-TR can be used to train high quality Turkish NLI models. We observe that every model performed better on the dev and test folds of SNLI-TR than the dev folds of MultiNLI-TR, which is an expected outcome given the greater complexity of MultiNLI compared to SNLI. The translation effort seems to have preserved this fundamental difference between the two datasets.

Results
In addition, BERTurk, which was trained on a Turkish corpus, achieved the highest accuracy, and BERT-Multi, which used a smaller Turkish corpus, was ranked the second, consistently on every evaluation fold. The ranking emphasizes the importance of having a Turkish corpus for pre-training.

Case Study II: Comparing Morphological Parsers on Turkish NLI Datasets
In this case study, we use NLI-TR to compare three morphological parsers with regular tokenization.
We train a BERT model from scratch utilizing each approach for pretraining and use NLI-TR for finetuning. This leads to the striking result that morphology adds additional information where training data is sparse, but its importance shrinks as the dataset grows larger.

Experimental Settings
Morphological Parsers We use Zemberek (Akın and Akın, 2007), BOUN (Sak et al., 2011), and Turkish Morphology (Öztürel et al., 2019) as parsers and compare them with an approach that does not do morphological parsing.
Zemberek is a mainstream Turkish NLP library used in research (Büyük, 2020;Kuyumcu et al., 2019;Özer et al., 2018;Can, 2017;Dehkharghani et al., 2016;Gulcehre et al., 2015) and applications such as iOS 12.2 and Open Office. It has 67,755 entries in its lexicon and uses a rule-based parser. BOUN implements the Turkish morphology rules described by Oflazer (1994) with a Finite State Transducer, and its lexicon has 55,278 entries. Finally, Turkish Morphology is an OpenFST-based (Allauzen et al., 2007) morphological parser that was recently released by Google Research and uses a lexicon with 47,202 entries.
Out of the box, Zemberek and BOUN can parse 398K and 51K tokens per minute respectively, whereas Turkish Morphology can process only 1K tokens. We sped up Turkish Morphology to parse 11 times more tokens per minute by implementing a dynamic programming wrapper (Bellman, 1952) that increased the cache hit ratio to 89.9%. This technique is already used by Zemberek.
Pretraining To conduct a wide range of experiments on a limited budget, we opted to use onetenth (≈4GB, 500M tokens) of the Turkish corpus used by BERTurk (Schweter, 2020) to pretrain BERT models. We analyzed each token morphologically using Zemberek, BOUN, and Turkish Morphology and trained a BERT model using the stems of the tokens only. For the model that does not utilize morphological information, we used tokens as they are. We used the BertWordPieceTokenizer class of Hug-gingFace Tokenizers 6 with the same set of parameters for each model. We trained each model on a single Tesla V100 GPU of an NVIDIA DGX-1 system, allocating 128GB memory for 1 day. We split the dataset into 30 equal shards for parallel processing, where each shard comprises 1M sentences, and shuffled the shards prior to training to reduce the adverse effects of variance across the sentence styles in the different shards (Goodfellow et al., 2016). We used an effective batch size of 128 with gradient accumulation to address memory limitations.
Fine-tuning We fine-tuned each model on NLI-TR with the same setting as in Section 4.1, with the exception that we trained for only 1 epoch. We measured the accuracy on the evaluation sets   with an interval of 1,000 training steps to observe the effect of morphological parsing as the dataset grew. Figure 1 reports the accuracy of all models with respect to fine-tuning steps on NLI-TR development sets, and Table 5 shows the final accuracies.

Results
Figure 1 suggests that morphological parsing is beneficial where the training set is small, but its importance largely disappears for large training sets. This is reflected also in the final results in Table 5. We relate this to the fact that BERT models create contextual embeddings of both word and subword tokens (Kudo, 2018;Kudo and Richardson, 2018;Sennrich et al., 2016). Given a sufficiently large dataset, BERT models can approximate the effects of morphological parsing even for Turkish, a morphologically-rich language. The trends are not uniform for SNLI-TR and MultiNLI-TR. For SNLI-TR, all three models display a similar learning curve, with a slight edge for Zemberek early on. For MultiNLI-TR, models with morphological parsers are more differentiated. However, all three converge to similar performance at the end of training on both datasets (Table 5).
In light of these findings, we suggest avoiding the use of morphological parsers for Turkish NLI where the training set is large, since the benefits of such parsers are generally not enough to offset the cost of running them.

Case Study III: Evaluating NLI-TR on Human-Translated Sentences
Thus far, we have used NLI-TR for both training and assessment. One might worry that machinetranslated test sets are not reliable tools for measuring how models will perform on examples written by humans. In this section, we address this concern using the Turkish dev and test portions of XNLI, which were translated entirely by humans. The models we assess on XNLI are those from our first case study as well as models trained on a different machine-translated training dataset, MultiNLI-TR XNLI . Overall, we find that performance on XNLI is consistently very similar to performance on NLI-TR.

Datasets
MultiNLI-TR XNLI was created to investigate the performance of cross-lingual sentence embeddings compared to in-language ones (Conneau et al., 2018). It provides machine translations of only the MultiNLI training set, so we report comparisons with just the corresponding section of NLI-TR, and we train models only on these two training sets.

Models
We used the BERT models from Case Study I (Section 4.1) for evaluation. We fine-tuned each model on the training sets of MultiNLI-TR and MultiNLI-TR XNLI separately, following the same Figure 1: Development set accuracy for the three morphological parsers and a model without morphological parsing. The x-axis tracks the size of the training set. We find that morphological parsing is generally helpful in early rounds, when the training set is very small, but that its importance diminishes as the training set increases. These effects are especially clear for the two MultiNLI-TR dev sets.
fine-tuning steps as in Section 4.1, and computed their accuracy on XNLI-Dev and XNLI-Test. Table 6 provides the results of the experiments. All three models consistently achieve higher accuracy on XNLI-Dev and XNLI-Test when fine-tuned with MultiNLI-TR, but the performance difference is modest. Table 6 also illustrates that BERTurk, backed by a Turkish-only training corpus, outperforms the other two models on all eight evaluations. Its performance is followed by BERT-Multi, which is trained on a corpus with texts in multiple languages, including Turkish. The same result was also shown in Case Study I using the evaluation splits of NLI-TR. Therefore, machine-translated MultiNLI-TR and human-translated XNLI display similar characteristics across evaluations, which lends further credence to our claim that MT can help provide a viable path to robust Turkish NLI.

Results
To better understand how in-language pretraining (BERTurk) helps, we investigated the 57 hypotheses from XNLI-Dev and XNLI-test where BERTurk was successful and BERT-Multi was  Table 6: Accuracy results comparing NLI-TR with another machine translated dataset. NLI-TR performed better, but the gap is modest, suggesting that both datasets have value for Turkish NLI. Figure 2 in our supplementary materials provides full learning curves. The results are very similar to those of Table 4 for MultiNLI, in overall quality and in the ranking of models.
not. For these sentences, we observed that the BERT-Multi tokenizer was often unable to segment the words into meaningful Turkish subword units, most likely due to its training on a multilingual corpus. For instance, BERT-Multi often could not segment the suffix "-me/ma", which negates a verb in Turkish, and thus bears crucial semantics for many contradiction examples (Gururangan et al., 2018). This shows that in-language training is essential not only for good vector representations but also for effective tokenization. We also hypothesize that subtle lexical distinctions are another factor in the performance difference between BERTurk and BERT-Multi. For example, though BERT-Multi successfully identified the semantic relations created by frequent pairs such as "hiç" ('any') and "hepsi" ('all'), it missed many other distinctions like these. We propose that this is due to the more limited vocabulary of BERT-Multi for Turkish and the more robust word representations in BERTurk.
In addition to manual inspection, we computationally analyzed the pairs where BERT-Multi was unsuccessful and BERTurk was successful. We computed the frequency of each semantic class in the BERT-Multi predictions for these sentences and observed that the neutral class is the most common. This perhaps reflects the fact that neutral is the default choice where the model cannot robustly identify a semantic relation.

Conclusion
We created and released the first large Turkish NLI dataset, NLI-TR, by machine translating SNLI and MultiNLI. Though English and Turkish have very different grammars and thus stress-test automatic approaches, our team of experts judged the translations to be of very high quality and to preserve the original NLI labels consistently. These results suggest that MT can help address the paucity of datasets for Turkish NLI. We release code, models, and data publicly for further research.
We also used NLI-TR to investigate central issues in Turkish NLI. First, we used NLI-TR to analyze the effects of in-language pretraining. Second, we compared three morphological parsers for Turkish with simpler tokenization schemes. We found that a Turkish-only pretraining regime can enhance Turkish models significantly, and that morphological parsing is arguably worth its costs only when the training dataset is small. In our final case study, we returned to the general issue of translation quality, but now from the perspective of developing NLI systems. We showed that models trained on MultiNLI-TR perform well on the expert-translated test set from XNLI.
On the basis of these findings, we argue that MT can be more widely adopted for advancing NLP studies on resource-constrained languages. Though language-dependent tasks like dependency parsing are challenging to translate, MT can efficiently transfer large and expensive-to-create labeled datasets from English to other languages in many NLP tasks, including text classification, question answering, and text summarization. In addition, MT will presumably get cheaper, faster, and better over time, thereby further strengthening our core claims. acknowledges the graduate research scholarship by TÜBİTAK under BİDEB 2211/A program.
The authors gratefully acknowledge that the computational parts of this study has been mostly performed at Bogaziçi TETAM DGX-1 GPU Cluster and partially carried out at TÜBİTAK ULAKBİM High Performance and Grid Computing Center (TRUBA resources) and Stanford Research Computing Center (FarmShare).

English Turkish Premise
Several people are on stage preparing for a show.

Entailment
People are setting up for a show.İnsanlar bir gösteri için hazırlanıyor.

Contradiction
A house is being demolished. Bir ev yıkılıyor. SNLI

Neutral
A crew is getting ready for a rock concert.

Premise
All rooms have color TV, alarm clock/radio, en-suite bathrooms, real hangers, and shower massage.

Entailment
All rooms also contain a ceiling fan and outlets for electronics.

Contradiction
You will not find a TV or alarm clock in any of the rooms.

MultiNLI Neutral
Color TVs, alarms, and hangers can be found in all rooms.
Tüm odalarda renkli TV'ler, alarmlar ve askılar bulunur.  Table 8: Accuracy of the cased models in Table 4 trained on SNLI and MultiNLI. We used the same fine-tuning and evaluation procedures. BERT-En ranked the first and BERT-Multi ranked the second, emphasizing the importance of in-language training one-more time as in Section 4.1.
.  Table 9: Accuracy results of the models in Table 6 for machine translated XNLI. The outcomes agree with the ones in Section 4.3, suggesting that machine translated sentences can be used to evaluate Turkish NLI models. Here we note that, XNLI-Dev-TR, XNLI-Test-TR and MultiNLI-TR are translated with the same MT service, whereas MultiNLI-TR XNLI used a different one. Though this might result in a positive bias for MultiNLI-TR models, we report the accuracy of MultiNLI-TR XNLI models as well for the sake of completeness.  Table 3. The label "broken" corresponds to the pairs which have either major translation error or no majority-level label.