PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification

Most existing work on adversarial data generation focuses on English. For example, PAWS (Paraphrase Adversaries from Word Scrambling) consists of challenging English paraphrase identification pairs from Wikipedia and Quora. We remedy this gap with PAWS-X, a new dataset of 23,659 human translated PAWS evaluation pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. We provide baseline numbers for three models with different capacity to capture non-local context and sentence structure, and using different multilingual training and evaluation regimes. Multilingual BERT fine-tuned on PAWS English plus machine-translated data performs the best, with a range of 83.1-90.8 accuracy across the non-English languages and an average accuracy gain of 23% over the next best model. PAWS-X shows the effectiveness of deep, multilingual pre-training while also leaving considerable headroom as a new challenge to drive multilingual research that better captures structure and contextual information.


Introduction
Adversarial examples have effectively highlighted the deficiencies of state-of-the-art models for many natural language processing tasks, e.g. question answering (Jia and Liang, 2017;Chen et al., 2018;Ribeiro et al., 2018), textual entailment (Zhao et al., 2018;Glockner et al., 2018), and text classification (Alzantot et al., 2018;Iyyer et al., 2018). Zhang et al. (2019) introduce PAWS, which has adversarial paraphrase identification pairs with high lexical overlap, like flights from New York to Florida and flights from Florida to New York. Such pairs stress the importance of modeling sentence structure and context because they have high * equal contribution word overlap ratio but different semantic meaning. In addition to revealing failures of state-of-the-art models, research on adversarial examples has generally shown that augmenting training data with good adversarial examples can boost performance for some models-providing greater clarity to the modeling landscape as well providing new headroom for further improvements.
Most previous work focuses only on English despite the fact that the problems highlighted by adversarial examples are shared by other languages. Existing multilingual datasets for paraphrase identification, e.g. Multi30k (Elliott et al., 2016) and Opusparcus (Creutz, 2018), lack challenging examples like PAWS. The lack of high-quality adversarial examples in other languages makes it difficult to benchmark model improvements. We bridge this gap by creating Cross-lingual PAWS (PAWS-X), an extension of the Wikipedia portion of the PAWS evaluation and test examples to six languages: Spanish, French, German, Chinese, Japanese, and Korean. This new corpus consists of 23,659 human translated example pairs with paraphrase judgments in each target language. Like previous work on multilingual corpus creation (Conneau et al., 2018), we machine translate the original PAWS English training set (49,401 pairs). Note that all translated pairs still have high word overlap and they inherit semantic similarity labels from the original PAWS examples; thus, the resulting dataset preserves the ability of probing structure and context sensitivity for models. We also machine translate the evaluation pairs of each language into English to establish the baseline performance of a translatethen-predict strategy. The PAWS-X dataset, including both the new human translated pairs and the machine translated examples, is available for download at https://github.com/ google-research-datasets/paws.  Our experiments show that PAWS-X effectively measures the multilingual adaptability of models and how well they capture context and word order. The state-of-the-art multilingual BERT model (Devlin et al., 2019) obtains a 32% (absolute) accuracy improvement over a bag-of-words model. We also show that machine translation helps and works better than a zero-shot strategy. We find that performance on German, French, Spanish is overall better than Chinese, Japanese and Korean.

PAWS-X Corpus
The core of our corpus creation procedure is to translate the Wikipedia portion of the original PAWS corpus from English (en) to six languages: French (fr), Spanish (es), German (de), Chinese (zh), Japanese (ja), and Korean (ko). To this end, we hire human translators to translate the development and test sets, and use a neural machine translation (NMT) service 1 to translate the training set.
We choose translation instead of repeating the PAWS data generation approach (Zhang et al., 2019) to other languages. This has at least three advantages. First, human translation does not require high-quality multilingual part-of-speech taggers or named entity recognizers, which play a key role in the data generation process used in Zhang et al. (2019). Second, human translators are trained to produce the target sentence while preserving meaning, thereby ensuring high data quality. Third, the resulting data can provide a new testbed for cross-lingual transfer techniques because examples in all languages are translated from the same sources. For example, PAWS-X could be used to evaluate whether a German or French sentence is a paraphrase of a Chinese or Japanese one.
Translating Evaluation Sets We obtain human translations on a random sample of 4,000 sentence pairs from the PAWS development set for each of the six languages (48,000 translations). The manual translation is performed by 10-20 in-house professionals that are native speakers of each language. A randomly sampled subset is presented and validated by a second worker. The final delivery is guaranteed to have less than 5% word level error rate. The sampled 4,000 pairs are split into new development and test sets, 2,000 pairs for each.
Due to time and cost constraints, we could not translate all 16,000 examples in both of original PAWS development and test set. Each sentence in a pair is presented independently so that translation is not affected by context. In our initial studies we noticed that sometimes it was difficult to translate an entity mention. We therefore ask translators to translate entity mentions, but different translators may have different preferences according to their background knowledge. Table 1 gives example translated pairs in each language.
Resulting Corpus Some sentences could not be be translated. Table 2 shows the final counts translated to each language. Most of the untranslated sentences were due to incompleteness or ambiguities, such as It said that Easipower was, and Park Green took over No. These sentences are likely from the adversarial generation process when creating PAWS. On average less than 2% of the pairs are not translated, and we simply exclude them.
The authors further verified translation quality for a random sample of ten pairs in each language. PAWS-X includes 23,459 human-translated pairs, including 11,815 and 11,844 pairs in development and test, respectively. Finally, original PAWS labels (paraphrase or not paraphrase) are mapped to the translations. Positive pairs account for 44.0% of development sets and 45.4% of test respectively-close to the PAWS label distribution.
Translation brings new challenges to the paraphrasing identification task. An entity can be translated differently, such as Slovak and Slowake (Table 1) and models need to capture that these refer to the same entity. In a more challenging example, Four Rivers, Audubon and Shawnee Trails are translated in just one of the sentences: en s1 From the merger of the Four Rivers Council and the Audubon Council, the Shawnee Trails Council was born. s2 Shawnee Trails Council was formed from the merger of the Four Rivers Council and the Audubon Council.
In the zh-s2 example, the parentheses give English glosses of Chinese entity mentions.

Evaluated Methods
The goal of PAWS-X is to probe models' ability to capture structure and context in a multilingual setting. We consider three models with varied complexity and expressiveness. The first baseline is a simple bag-of-words (BOW) encoder with cosine similarity. It uses unigram to bigram token encoding as input features and takes a cosine value above 0.5 as a paraphrase. The second model is ESIM, Enhanced Sequential Inference Model (Chen et al., 2017  Multilingual BERT is a single model trained on 104 languages, which enables experiments with cross-lingual training regimes. (1) Zero Shot: the model is trained on the PAWS English training data, and then directly evaluated on all others. Machine translation is not involved in this strategy.
(2) Merged: train a multilingual model on all languages, including the original English pairs and machine-translated data in all other languages. Table 3 summarizes the models with respect to whether they represent non-local contexts or support cross-sentential word interaction, plus which strategies are evaluated for each model.

Experiments and Results
We use the latest public multilingual BERT base model with 12 layers 2 and apply the default finetuning strategy with batch size 32 and learning rate 1e-5. For BOW and ESIM, we use our own implementations and 300 dimensional multilingual word embeddings from fastText. 3 We allow finetuning word embeddings during training, which gives better empirical performance.
We use two metrics: classification accuracy and area-under-curve scores of precision-recall curves (AUC-PR). For BERT, probability scores for the positive class is used to compute AUC-PR. For BOW and ESIM a cosine threshold of 0.5 is used to compute accuracy. In all experiments, the best   model checkpoint is chosen based on accuracy on development sets and report results on testing sets.
Results Table 4 shows the performance of all methods and languages. Table 5 summarizes the average results for the six non-English languages.
Model Comparisons: On both Translate Train and Translate Test, BERT consistently outperforms both BOW and ESIM by a substantial margin (>15% absolute accuracy gains) across all seven languages. BERT Translate Train achieves an average 20% accuracy gain. This result demonstrates that PAWS-X effectively measures models' sensitivity to word order and syntactic structure.
Training/Evaluation Strategies: As Table 4 and 5 show, the Zero Shot strategy yields the lowest performance compared to other strategies on BERT. This is evidence that machine-translated data helps in the multilingual scenario. Indeed, when training on machine-translated examples in all languages (Merged), the model achieves the best performance, with 8.6% accuracy and 7.1% AUC-PR average gains over Zero Shot.
BERT  gual BERT is pre-trained on over one hundred languages; hence BERT provides better initialization for non-English languages than ESIM (which relies on fastText embeddings). The gap between training on English and on other languages is therefore smaller on BERT than on ESIM, which makes Translate Train work better on BERT.
Language Difference: Across all models and approaches, performance on Indo-European languages (German, French, Spanish) is consistently better than CJK (Chinese, Japanese, Korean). The performance difference is particularly noticeable on Zero Shot. This can be explained from two perspectives. First, the MT system we used works better on Indo-European languages than on CJK. Second, the CJK family is more typologically and syntactically different from English. For example, in table 1, Slowake in German is much closer to the original term Slovak in English, compared with its Chinese translation 斯洛伐克. This at least partly explains why performance on CJK is particularly poor in Zero Shot.
Error Analysis: To gauge the difficulty of each example for the best model (BERT-merged), Table  6 shows the count of examples based on how many languages for the same pair are assigned the correct label in test set. The majority of the examples are easy, with 61.7% correct in all languages. Of the 32 examples that failed in all languages, most are hard or highly ambiguous. Some have incorrect gold labels or were generated incorrectly in the original PAWS data.
The following is a sample of these. b2 He eventually established himself in northwestern Italy, apparently supported by Guy, where he probably received the title of "comes". not match We also considered examples that are correctly predicted in just half of the languages. Some of these failed because of translation noise, e.g. inconsistent entity translations (as shown in §2).

Conclusion
We introduce PAWS-X, a challenging paraphrase identification dataset with 23,659 human translated evaluation pairs in six languages. Our experimental results showed that PAWS-X effectively measures sensitivity of models to word order and the efficacy of cross-lingual learning approaches. It also leaves considerable headroom as a new challenging benchmark to drive multilingual research on the problem of paraphrase identification.