Robustness to Modification with Shared Words in Paraphrase Identification

Revealing the robustness issues of natural language processing models and improving their robustness is important to their performance under difficult situations. In this paper, we study the robustness of paraphrase identification models from a new perspective – via modification with shared words, and we show that the models have significant robustness issues when facing such modifications. To modify an example consisting of a sentence pair, we either replace some words shared by both sentences or introduce new shared words. We aim to construct a valid new example such that a target model makes a wrong prediction. To find a modification solution, we use beam search constrained by heuristic rules, and we leverage a BERT masked language model for generating substitution words compatible with the context. Experiments show that the performance of the target models has a dramatic drop on the modified examples, thereby revealing the robustness issue. We also show that adversarial training can mitigate this issue.


Introduction
Paraphrase identification is to determine whether a pair of sentences have the same meaning (Socher et al., 2011), with many applications such as duplicate question matching on social media (Iyer et al., 2017) and plagiarism detection (Clough, 2000). It can be viewed as a sentence matching problem, and many neural models have achieved great performance on benchmark datasets (Wang et al., 2017;Gong et al., 2017;Devlin et al., 2018).
Despite this progress, there is not much work on the robustness of paraphrase identification models, while natural language processing (NLP) models on other tasks have been shown to be vulnerable and lack of robustness. In previous works * Corresponding author for the robustness of NLP models, constructing semantic-preserving perturbations to input sentences while making the model prediction significantly change appears to be a popular way, in tasks such as text classification and natural language inference (Alzantot et al., 2018;Jin et al., 2019). However, on specific tasks, it is possible to design modification that is not necessarily semanticpreserving, which can further reveal more robustness issues. For instance, on reading comprehension, Jia and Liang (2017) conducted modification by inserting distracting sentences to the input paragraphs. Such findings can be important for investigating and resolving the weakness of NLP models.
On paraphrase identification, to the best of our knowledge, the only previous work is PAWS  with a cross-lingual version (Yang et al., 2019), which found that models often make false positive predictions when words in the two sentences only differ by word order. However, this approach is for negative examples only, and for positive examples, they used back-translation to still generate semantically similar sentences. Moreover, it was unknown whether models still easily make false positive predictions when the word overlap between the two sentences is much smaller than 100%.
In this paper, we propose an algorithm for studying the robustness of paraphrase identification models from a new perspective -via modifications with shared words (words that are shared by both sentences). For positive examples, i.e., the two sentences are paraphrases, we aim to see whether models can still make correct predictions when some shared words are replaced. Each pair of selected shared words are replaced with a new word, and the new example tends to remain positive. As the first example in Figure 1 shows, by replacing "purpose" and "life" with "measure" and "value" respectively, the sen-  Figure 1: Examples with labels positive and negative respectively, originally from Quora Question Pairs (QQP) (Iyer et al., 2017). "(P)" and "(Q)" are original sentences while "(P')" and "(Q')" are modified. Modified words are highlighted in bold. "Output" indicates the change of output labels by BERT (Devlin et al., 2018), where the percentage numbers are confidence scores.
tences change from asking about "purpose of life" to "measure of value" and remain paraphrases, but the target model makes a wrong prediction. This indicates that the target model has a weakness in generalizing from "purpose of life" to "measure of value". On the other hand, for negative examples, we replace some words and introduce new shared words to the two sentences while trying to keep the new example negative. As the second example in Figure 1 shows, with new shared words "credit" and "score" introduced, the new example remains negative but the target model makes a false positive prediction. This reveals that the target model can be distracted by the shared words while ignoring the difference in the unmodified parts. The unmodified parts of the two sentences have a low word overlap to reveal such a weakness. In contrast, examples in PAWS had exactly the same bag of words and are not capable for this investigation.
In our word replacement, to preserve the label and language quality, we impose heuristic constraints on replaceable positions. Furthermore, we apply a BERT masked language model (Devlin et al., 2018) to generate substitution words compatible with the context. We use beam search to find a word replacement solution that approximately maximizes the loss of the target model and thereby tends to make the model fail.
We summarize our contributions below: • We study the robustness of paraphrase identification models via modification with shared words. Experiments show that models have a severe performance drop on our modified examples, which reveals a robustness issue.
• We propose a novel and concise method that leverages the BERT masked language model for generating substitution words compatible with the context.
• We show that adversarial training with our generated examples can mitigate the robustness issue.
• Compared to previous works, our perspective is new: 1) Our modification is not limited to be semantic-preserving; and 2) Our negative examples have much lower word overlap between two sentences, compared to PAWS.
2 Related Work

Paraphrase Identification Models
There exist many neural models for sentence matching and paraphrase identification. Some works applied a classifier on independentlyencoded embeddings of two sentences (Bowman et al., 2015;Yang et al., 2015;Conneau et al., 2017), and some others made strong interactions between the two sentences by jointly encoding and matching them (Wang et al., 2017;Duan et al., 2018;Kim et al., 2018) or hierarchically extracting features from their interaction space (Hu et al., 2014;Pang et al., 2016;Gong et al., 2017). Notably, BERT pre-trained on large-scale corpora achieved even better results (Devlin et al., 2018).

Robustness of NLP Models
On the robustness of NLP models, many previous works constructed semantic-preserving perturbations to input sentences (Alzantot et al., 2018;Iyyer et al., 2018;Ribeiro et al., 2018;Hsieh et al., 2019;Jin et al., 2019;Ren et al., 2019). However, NLP models for some tasks have robustness issues not only when facing semantic-preserving perturbations. In reading comprehension, Jia and Liang (2017) studied the robustness issue when a distractor sentence is added to the paragraph. In natural language inference, Minervini and Riedel (2018) considered logical rules of sentence relations, and Glockner et al. (2018) used single word replacement with lexical knowledge. Thus methods for general NLP tasks alone are insufficient for studying the robustness of specific tasks. In particular, for paraphrase identification, the only prior work is PAWS Yang et al., 2019) which used word swapping, but this method is for negative examples only and each constructed pair of sentences have exactly the same words.

Algorithm Framework
Paraphrase identification can be formulated as follows: given two sentences P = p 1 p 2 · · · p n and Q = q 1 q 2 · · · q m , the goal is to predict whether P and Q are paraphrases. The model outputs a score [Z(P, Q)]ŷ for each classŷ ∈ Y = {positive, negative}, where positive means P and Q are paraphrases and vice versa. We first sample an original example from the dataset and then conduct modification. We take multiple steps for modification until the model fails or the step number limit is reached. In each step, we replace a word pair with a shared word, and we evaluate different options according to the model loss they induce. We use beam search to find approximately optimal options. The modified example evaluated as the best option is finally returned.
In the remainder of this section, we introduce what modification options are considered available to our algorithm in Sec. 3.2 and how to find optimal modification solutions in Sec. 3.3.

Modification Options
Original Example Sampling To sample an original example from the dataset, for a positive example, we directly sample a positive example from the original data, namely, (P, Q, positive); and for a negative example, we sample two different sentence pairs (P 1 , Q 1 ) and (P 2 , Q 2 ), and we then form a negative example (P 1 , Q 2 , negative). Replaceable Position Pairs For a sentence pair under modification, we impose heuristic rules on replaceable position pairs. First, we do not replace stopwords. Besides, for a positive example, we require each replaceable word pair to be shared words, while for a negative example, we only require them to be both nouns, both verbs, or both adjectives, according to Part-of-Speech (POS) tags obtained using Natural Language Toolkit (NLTK) (Bird et al., 2009). Two examples are shown in Figure 2. For the first example (positive), only shared words "purpose" and "life" can be replaced, and the two modified sentences are likely to talk about another same thing, e.g. changing from "purpose of life" to "measure of value", and thereby the new example tends to remain positive. As for the second example (negative), nouns "Gmail", "account", "school", "management" and "software" can be replaced. Consequently, the modified sentences are based on templates "How can I get · · · back ? " and "What is the best · · · ?", and the pair tends to remain negative even if the template is filled by shared words. In this way, the labels can usually be preserved.

Substitution Words
We use a pre-trained BERT masked language model (Devlin et al., 2018) to generate substitution words compatible with the context, for each replaceable position pair. Specifically, to replace word p i and q j from the two sentences respectively with some shared word w, we compute a joint probability distribution P(w|p 1:i−1 , p i+1:n , q 1:j−1 , q j+1:m ) =P(w|p 1:i−1 , p i+1:n ) · P(w|q 1:j−1 , q j+1:m ), where s i:j denotes the subsequence starting from i to j. P(w|p1:i−1, pi+1:n) and P(w|q1:j−1, qj+1:m) are obtained from the language model by masking p i and q j respectively. We rank all the words in the vocabulary of the model and choose top K words with largest probabilities, as the candidate substitution words for the position pair.
This method of generating substitution words enables us to find out possible substitution words and also verify their compatibility with the context simultaneously, compared to previous methods that have these two separated (Alzantot et al., 2018;Jin et al., 2019) -they first constructed a candidate substitution word list from synonyms, and using each substitution word respectively, they then checked the language quality or semantic similarity constraints of the new sentence. Moreover, some recent works (Li et al., 2020;Garg and Ramakrishnan, 2020) that appeared later than our preprint have shown that using a masked language model for substituting words can outperform stateof-the-art methods in generating adversarial examples on text classification and natural language inference tasks.

Finding Modification Solutions
We then use beam search with beam size B to find a modification solution in multiple steps. At step t, we have two stages to determine the replaced positions and the substitution words respectively, based on a two-stage framework (Yang et al., 2018). For the topmost example, if the label predicted by the model is already incorrect, we finish the modification process. Otherwise, we take more steps until the model fails or the step number limit S is reached.

Datasets and Target Models
We adopt two datasets.
The Quora Question Pairs, QQP (Iyer et al., 2017), consists of 384,348/10,000/10,000 question pairs in the training/development/test set as we follow the partition in Wang et al. (2017). And the Microsoft Research Paraphrase Corpus, MRPC (Dolan and Brockett, 2005), consists of sentence pairs from news with 4,076/1,725 pairs in the training/test set. Each sentence pair is annotated with a label indicating whether the two sentences are paraphrases or not (positive or negative).
We study three typical models for paraphrase identification.
BiMPM (Wang et al., 2017) matches two sentences from multiple perspectives using BiLSTM layers. DIIN (Gong et al., 2017) adopts DenseNet (Huang et al., 2017) to extract interaction features. BERT (Devlin et al., 2018) is a pre-trained encoder fine-tuned on this task with a classifier applied on encoded representations. These models are representative in terms of backbone neural architectures: BiMPM is based on recurrent neural networks, DIIN on convolutional neural networks, and BERT on Transformers.

Performance on Modified Examples
We train each model on the original training set and then try to construct modification that makes the models fail. For each dataset, we sample 1,000 original examples with balanced labels from the test set, and we modify them for each model. We evaluate the accuracies of the models on our modified examples. Table 1 shows the results. We focus on rows with "normal" for column "training" in this section. The models have high overall accuracies on the original data, but their performance drops dramatically on our modified examples (e.g., the overall accuracy of BERT on QQP drops from 94.3% to 24.1%). This demonstrates that the models indeed have the robustness issue we aim to reveal. Some examples are provided in Appendix B.

Adversarial Training
To improve the model robustness, we further finetune the models using adversarial training. with the current model as the target and update the model parameters iteratively. The beam size for generation is set to 1 to reduce the computational cost. We evaluate the adversarially trained models as shown in Table 1 (rows with "adversarial" for column "training"). The performance on modified examples of all the models raises significantly (e.g. the overall accuracy of BERT on modified examples raises from 24.1% to 66.0% for QQP and from 23.8% to 87.0% for MRPC). This demonstrates that adversarial training with our modified examples can significantly improve the robustness, yet without remarkably hurting the performance on original data. An improvement on the original data is not expected since they cannot reflect robustness and it is even common to see a small drop in previous works (Jia and Liang, 2017;Iyyer et al., 2018;Ribeiro et al., 2018). We also manually verify the quality of the modified examples in terms of the label correctness and grammaticality. For each dataset, using BERT as the target, we randomly sample 100 modified examples with balanced labels such that the model makes wrong predictions, and we present each of them to three workers on Amazon Mechanical Turk. We ask the workers to label the examples and also rate the grammaticality of the sentences with a scale of 1/2/3. We integrate annotations from different workers with majority voting for labels and averaging for grammaticality. Results are shown in Table 2

Conclusion
In this paper, we present a novel algorithm to study the robustness of paraphrase identification models. We show that the target models have a robustness issue when facing modification with shared words. Such modification is substantially different from those in previous works -the modification is not semantic-preserving and each pair of modified sentences generally have a much lower word overlap, and thereby it reveals a new robustness issue. We also show that model robustness can be improved using adversarial training with our modified examples.

A Implementation Details
We adopt open source codes for BiMPM 1 , DIIN 2 and BERT (BERT base is used) 3 , and the datasets are downloaded from the internet for both QQP 4 and MRPC 5 . There are 1.4, 42.8, and 109.5 million parameters in BiMPM, DIIN and BERT respectively.
For QQP, the step number limit of modification, S, is set to 5; the number of candidate substitution words suggested by the language model, K, and the beam size B are both set to 25. S, K and B are doubled for MRPC where sentences are generally longer.
We  Table 3: Modified examples for BERT as the target model on QQP. "(P)" and "(Q)" indicate original sentences, and "(P')" and "(Q')" indicate modified sentences. Modified words are highlighted in bold.  The spacecraft is scheduled to blast off as early as tomorrow or as late as Friday from the Jiuquan launching site in the Gobi Desert . (Q)

B Examples of Our Modifications
The spacecraft is scheduled to blast off between next Wednesday and Friday from a launching site in the Gobi Desert . (P') The match is scheduled to kick off as early as tomorrow or as late as Friday from the Jiuquan long day in the hot summer . (Q') The match is scheduled to kick off between next Wednesday and Friday from a long day in the hot summer . Label Positive Output Positive → Negative (P) The resolution was approved with no debate by delegates at the bar association 's annual meeting here . (Q) Morales , who pleaded guilty in July , expressed " sincere regret and remorse " for his crimes . (P') The loss was approved with no surprise by delegates at the bar association 's annual meeting here . (Q') Morales , who pleaded guilty in July , expressed " sincere regret and surprise " for his loss . Label Negative Output Negative → Positive We show some examples that our modification with shared words can make the target model fail, to further illustrate the robustness issue we reveal. Table 3 presents two examples using BERT as the target model on QQP. For the first example (positive), changing from asking about "lose weight" to "buy anything" fools the target model to alter the predicted label, though the modified sentences are still asking about the same thing and are paraphrases. For the second example (negative), introducing new shared words "local", "global", "interactions", "existence" and "plane" fools the target model to predict that the modified sentences are paraphrases, although the new sentences are still asking about different things. Similarly, Table 4 presents two examples on MRPC.