Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation

Robustness and counterfactual bias are usually evaluated on a test dataset. However, are these evaluations robust? If the test dataset is perturbed slightly, will the evaluation results keep the same? In this paper, we propose a “double perturbation” framework to uncover model weaknesses beyond the test dataset. The framework first perturbs the test dataset to construct abundant natural sentences similar to the test data, and then diagnoses the prediction change regarding a single-word substitution. We apply this framework to study two perturbation-based approaches that are used to analyze models’ robustness and counterfactual bias in English. (1) For robustness, we focus on synonym substitutions and identify vulnerable examples where prediction can be altered. Our proposed attack attains high success rates (96.0%-99.8%) in finding vulnerable examples on both original and robustly trained CNNs and Transformers. (2) For counterfactual bias, we focus on substituting demographic tokens (e.g., gender, race) and measure the shift of the expected prediction among constructed sentences. Our method is able to reveal the hidden model biases not directly shown in the test dataset. Our code is available at https://github.com/chong-z/nlp-second-order-attack.

x 0 ="a deep and meaningful film (movie)." X test x 0 ="a short and moving film (movie)." Numbers on the bottom right are the sentiment predictions for film and movie. Blue x 0 comes from the test dataset and its prediction cannot be altered by the substitution film → movie (robust). Yellow examplex 0 is slightly perturbed but remains natural. Its prediction can be altered by the substitution (vulnerable).
In most studies, model robustness is evaluated based on a given test dataset or synthetic sentences constructed from templates (Ribeiro et al., 2020). Specifically, the robustness of a model is often evaluated by the ratio of test examples where the model prediction cannot be altered by semantic-invariant perturbation. We refer to this type of evaluations as the first-order robustness evaluation. However, even if a model is first-order robust on an input sentence x 0 , it is possible that the model is not robust on a natural sentencex 0 that is slightly modified from x 0 . In that case, adversarial examples still exist even if first-order attacks cannot find any of them from the given test dataset. Throughout this paper, we callx 0 a vulnerable example. The existence of such examples exposes weaknesses in models' understanding and presents challenges for model deployment. Fig. 1 illustrates an example.
In this paper, we propose the double perturbation framework for evaluating a stronger notion of second-order robustness. Given a test dataset, we consider a model to be second-order robust if there is no vulnerable example that can be identified in the neighborhood of given test instances ( §2.2). In particular, our framework first perturbs the test set to construct the neighborhood, and then diagnoses the robustness regarding a single-word synonym substitution. Taking Fig. 2 as an example, the model is first-order robust on the input sentence x 0 (the prediction cannot be altered), but it is not second-order robust due to the existence of the vulnerable examplex 0 . Our framework is designed to identifyx 0 .
We apply the proposed framework and quantify second-order robustness through two second-order attacks ( §3). We experiment with English sentiment classification on the SST-2 dataset (Socher et al., 2013) across various model architectures. Surprisingly, although robustly trained CNN (Jia et al., 2019) and Transformer (Xu et al., 2020) can achieve high robustness under strong attacks (Alzantot et al., 2018;Garg and Ramakrishnan, 2020) (23.0%-71.6% success rates), for around 96.0% of the test examples our attacks can find a vulnerable example by perturbing 1.3 words on average. This finding indicates that these robustly trained models, despite being first-order robust, are not second-order robust.
Furthermore, we extend the double perturbation framework to evaluate counterfactual biases (Kusner et al., 2017) ( §4) in English. When the test dataset is small, our framework can help improve the evaluation robustness by revealing the hidden biases not directly shown in the test dataset. Intuitively, a fair model should make the same prediction for nearly identical examples referencing different groups (Garg et al., 2019) with different protected attributes (e.g., gender, race). In our evaluation, we consider a model biased if substituting tokens associated with protected attributes changes the expected prediction, which is the average prediction among all examples within the neighborhood. For instance, a toxicity classifier is biased if it tends to increase the toxicity if we substitute straight → gay in an input sentence (Dixon et al., 2018). In the experiments, we evaluate the expected sentiment predictions on pairs of protected tokens (e.g., (he, she), (gay, straight)), and demonstrate that our method is able to reveal the hidden model biases.
Our main contributions are: (1) We propose the double perturbation framework to diagnose the robustness of existing robustness and fairness evaluation methods. (2) We propose two second-order attacks to quantify the stronger notion of second-  Figure 2: An illustration of the decision boundary. Diamond area denotes invariance transformations. Blue x 0 is a robust input example (the entire diamond is green). Yellowx 0 is a vulnerable example in the neighborhood of x 0 . Redx 0 is an adversarial example tox 0 . Note: x 0 is not an adversarial example to x 0 since they have different meanings to human (outside the diamond). order robustness and reveal the models' vulnerabilities that cannot be identified by previous attacks.
(3) We propose a counterfactual bias evaluation method to reveal the hidden model bias based on our double perturbation framework.

The Double Perturbation Framework
In this section, we describe the double perturbation framework which focuses on identifying vulnerable examples within a small neighborhood of the test dataset. The framework consists of a neighborhood perturbation and a word substitution. We start with defining word substitutions.

Existing Word Substitution Strategy
We focus our study on word-level substitution, where existing works evaluate robustness and counterfactual bias by directly perturbing the test dataset. For instance, adversarial attacks alter the prediction by making synonym substitutions, and the fairness literature evaluates counterfactual fairness by substituting protected tokens. We integrate the word substitution strategy into our framework as the component for evaluating robustness and fairness.
For simplicity, we consider a single-word substitution and denote it with the operator ⊕. Let X ⊆ V l be the input space where V is the vocabulary and l is the sentence length, p = (p (1) , p (2) ) ∈ V 2 be a pair of synonyms (called patch words), X p ⊆ X denotes sentences with a single occurrence of p (1) (for simplicity we skip other sentences), x 0 ∈ X p be an input sentence, then x 0 ⊕ p means "substitute p (1) → p (2) in x 0 ". The result after substitution is: Taking Fig. 1 as an example, where p = (film, movie) and x 0 = a deep and meaningful film, the perturbed sentence is x 0 = a deep and meaningful movie. Now we introduce other components in our framework.

Proposed Neighborhood Perturbation
Instead of applying the aforementioned word substitutions directly to the original test dataset, our framework perturbs the test dataset within a small neighborhood to construct similar natural sentences. This is to identify vulnerable examples with respect to the model. Note that examples in the neighborhood are not required to have the same meaning as the original example, since we only study the prediction difference caused by applying synonym substitution p ( §2.1).
Constraints on the neighborhood. We limit the neighborhood sentences within a small 0 norm ball (regarding the test instance) to ensure syntactic similarity, and empirically ensure the naturalness through a language model. The neighborhood of an input sentence x 0 ∈ X is: where Ball k (x 0 ) = {x | x − x 0 0 ≤ k, x ∈ X } is the 0 norm ball around x 0 (i.e., at most k different tokens), and X natural denotes natural sentences that satisfy a certain language model score which will be discussed next.
Construction with masked language model. We construct neighborhood sentences from x 0 by substituting at most k tokens. As shown in Algorithm 1, the construction employs a recursive approach and replaces one token at a time. For each recursion, the algorithm first masks each token of the input sentence (may be the original x 0 or thex from last recursion) separately and predicts likely replacements with a masked language model (e.g., DistilBERT, Sanh et al. 2019). To ensure the naturalness, we keep the top 20 tokens for each mask with the largest logit (subject to a threshold, Line 9). Then, the algorithm constructs neighborhood sentences by replacing the mask with found tokens. We use the notationx in the following sections to denote the constructed sentences within the neighborhood.

Algorithm 1: Neighborhood construction
Data: Input sentence x0, masked language model LM, max distance k. 1 Function Neighbor k (x0): Mask ith token and return candidate tokens and corresponding logits.
Construct new sentences by replacing the ith token.

Evaluating Second-Order Robustness
With the proposed double perturbation framework, we design two black-box attacks 1 to identify vulnerable examples within the neighborhood of the test set. We aim at evaluating the robustness for inputs beyond the test set.

Previous First-Order Attacks
Adversarial attacks search for small and invariant perturbations on the model input that can alter the prediction. To simplify the discussion, in the following, we take a binary classifier f (x) : X → {0, 1} as an example to describe our framework. Let x 0 be the sentence from the test set with label y 0 , then the smallest perturbation δ * under 0 norm distance is: 2 Here δ = p 1 ⊕ · · · ⊕ p l denotes a series of substitutions. In contrast, our second-order attacks fix δ = p and search for the vulnerable x 0 .

Proposed Second-Order Attacks
Second-order attacks study the prediction difference caused by applying p. For notation convenience we define the prediction difference F (x; p) : x0 = a deep and meaningful film. p = film, movie x (i = 2) a short and moving film (movie). a slow and moving film (movie). a dramatic or meaningful film (movie).
· · · fsoft(x) .990 (.989 p alters the prediction. x0 ="a short and moving film (movie)." (70% negative) 73% positive Figure 3: The attack flow for SO-Beam (Algorithm 2). Blue x 0 is the input sentence and yellowx 0 is our constructed vulnerable example (the prediction can be altered by substituting film → movie). Green boxes in the middle show intermediate sentences, and f soft (x) denotes the probability outputs for film and movie. (2) Taking Fig. 1 as an example, the prediction differ- Given an input sentence x 0 , we want to find patch words p and a vulnerable examplex 0 such that f (x 0 ⊕ p) = f (x 0 ). Follow Alzantot et al. (2018), we choose p from a predefined list of counter-fitted synonyms (Mrkšić et al., 2016) that maximizes |f soft (p (2) ) − f soft (p (1) )|. Here f soft (x) : X → [0, 1] denotes probability output (e.g., after the softmax layer but before the final argmax), f soft (p (1) ) and f soft (p (2) ) denote the predictions for the single word, and we enumerate through all possible p for x 0 . Let k be the neighborhood distance, then the attack is equivalent to solving: (3) Brute-force attack (SO-Enum). A naive approach for solving Eq. (3) is to enumerate through Neighbor k (x 0 ). The enumeration finds the smallest perturbation, but is only applicable for small k (e.g., k ≤ 2) given the exponential complexity. Beam-search attack (SO-Beam). The efficiency can be improved by utilizing the probability output, where we solve Eq. (3) by minimizing the crossentropy loss with regard to x ∈ Neighbor k (x 0 ): where f min and f max are the smaller and the larger output probability between f soft (x) and f soft (x ⊕ p), respectively. Minimizing Eq. (4) effectively leads to f min → 0 and f max → 1, and we use a beam search to find the best x. At each iteration, we construct sentences through Neighbor 1 (x) and only keep the top 20 sentences with the smallest L(x; p). We run at most k iterations, and stop earlier if we find a vulnerable example. We provide the detailed implementation in Algorithm 2 and a flowchart in Fig. 3. Keep the best beam. We set β ← 20. 10 return None;

Experimental Results
In this section, we evaluate the second-order robustness of existing models and show the quality of our constructed vulnerable examples.

Setup
We follow the setup from the robust training literature (Jia et al., 2019;Xu et al., 2020) and experiment with both the base (non-robust) and robustly trained models. We train the binary sentiment classifiers on the SST-2 dataset with bag-ofwords (BoW), CNN, LSTM, and attention-based Original: 70% Negative Input Example: in its best moments , resembles a bad high school production of grease , without benefit of song .
Genetic: 56% Positive Adversarial Example: in its best moment , recalling a naughty high school production of lubrication , unless benefit of song .

BAE: 56% Positive
Adversarial Example: in its best moments , resembles a great high school production of grease , without benefit of song .
SO-Enum and SO-Beam (ours): 60% Negative (67% Positive) Vulnerable Example: in its best moments , resembles a bad (unhealthy) high school production of musicals , without benefit of song . Attack success rate (second-order). We also quantify second-order robustness through attack success rate, which measures the ratio of test examples that a vulnerable example can be found.
To evaluate the impact of neighborhood size, we experiment with two configurations: (1) For the small neighborhood (k = 2), we use SO-Enum that finds the most similar vulnerable example.
(2) For the large neighborhood (k = 6), SO-Enum is not applicable and we use SO-Beam to find vulnerable examples. We consider the most challenging setup and use patch words p from the same set of counter-fitted synonyms as robust models (they are provably robust to these synonyms on the test set). We also provide a random baseline to validate the effectiveness of minimizing Eq. (4) (Appendix A.1).

Quality metrics (perplexity and similarity).
We

Results
We experiment with the validation split (872 examples) on a single RTX 3090. The average running time per example (in seconds) on base LSTM is 31.9 for Genetic, 1.1 for BAE, 7.0 for SO-Enum (k = 2), and 1.9 for SO-Beam (k = 6). We provide additional running time results in Appendix A.3. Table 1 provides an example of the attack result where all attacks are successful (additional examples in Appendix A.5). As shown, our secondorder attacks find a vulnerable example by replacing grease → musicals, and the vulnerable example has different predictions for bad and unhealthy. Note that, Genetic and BAE have different objectives from second-order attacks and focus on finding the adversarial example. Next we discuss the results from two perspectives. Second-order robustness. We observe that existing robustly trained models are not second-order robust. As shown in  Furthermore, applying existing attacks on the vulnerable examples constructed by our method will lead to much smaller perturbations. As a reference, on the robustly trained CNN, Genetic attack constructs adversarial examples by perturbing 2.7 words on average (starting from the input examples). However, if Genetic starts from our vulnerable examples, it would only need to perturb a single word (i.e., the patch words p) to alter the prediction. These results demonstrate the weakness of the models (even robustly trained) for those inputs beyond the test set.

Human Evaluation
We perform human evaluation on the examples constructed by SO-Beam. Specifically, we randomly

Evaluating Counterfactual Bias
In addition to evaluating second-order robustness, we further extend the double perturbation framework ( §2) to evaluate counterfactual biases by setting p to pairs of protected tokens. We show that our method can reveal the hidden model bias.

Counterfactual Bias
In contrast to second-order robustness, where we consider the model vulnerable as long as there exists one vulnerable example, counterfactual bias focuses on the expected prediction, which is the average prediction among all examples within the neighborhood. We consider a model biased if the expected predictions for protected groups are different (assuming the model is not intended to discriminate between these groups). For instance, a sentiment classifier is biased if the expected prediction for inputs containing woman is more positive (or negative) than inputs containing man. Such bias is harmful as they may make unfair decisions based on protected attributes, for example in situations such as hiring and college admission.
We calculate Eq. (5)   The model is unbiased on p if B p,k ≈ 0, whereas a positive or negative B p,k indicates that the model shows preference or against to p (2) , respectively. Fig. 4 illustrates the distribution of (x, x ⊕ p) for both an unbiased model and a biased model. The aforementioned neighborhood construction does not introduce additional bias. For instance, let x 0 be a sentence containing he, even though it is possible for Neighbor 1 (x 0 ) to contain many stereotyping sentences (e.g., contains tokens such as doctor and driving) that affect the distribution of f soft (x), but it does not bias Eq. (6) as we only care about the prediction difference of replacing he → she. The construction has no information about the model objective, thus it would be difficult to bias f soft (x) and f soft (x ⊕ p) differently.

Experimental Results
In this section, we use gender bias as a running example, and demonstrate the effectiveness of our method by revealing the hidden model bias. We provide additional results in Appendix A.4.

Setup
We evaluate counterfactual token bias on the SST-2 dataset with both the base and debiased models. We focus on binary gender bias and set p to pairs of gendered pronouns from Zhao et al. (2018a). Base Model. We train a single layer LSTM with pre-trained GloVe embeddings and 75 hidden size (from TextAttack, Morris et al. 2020). The model has 82.9% accuracy similar to the baseline performance reported in GLUE. Debiased Model. Data-augmentation with gender swapping has been shown effective in mitigating gender bias (Zhao et al., 2018a. We augment the training split by swapping all male entities with the corresponding female entities and vice-versa. We use the same setup as the base LSTM and attain 82.45% accuracy. Figure 5: Our proposed B p,k measured on X filter . Here "original" is equivalent to k = 0, "perturbed" is equivalent to k = 3, p is in the form of (male, female).
Metrics. We evaluate model bias through the proposed B p,k for k = 0, . . . , 3. Here the bias for k = 0 is effectively measured on the original test set, and the bias for k ≥ 1 is measured on our constructed neighborhood. We randomly sample a subset of constructed examples when k = 3 due to the exponential complexity. Filtered test set. To investigate whether our method is able to reveal model bias that was hidden in the test set, we construct a filtered test set on which the bias cannot be observed directly. Let X test be the original validation split, we construct X filter by the equation below and empirically set = 0.005. We provide statistics in Table 5.

Results
Our method is able to reveal the hidden model bias on X filter , which is not visible with naive measurements. In Fig. 5, the naive approach (k = 0) observes very small biases on most tokens (as constructed). In contrast, when evaluated by our double perturbation framework (k = 3), we are able to observe noticeable bias, where most p has a positive bias on the base model. This observed bias is in line with the measurements on the original X test (Appendix A.4), indicating that we reveal the correct model bias. Furthermore, we observe mitigated biases in the debiased model, which demonstrates the effectiveness of data augmentation.
To demonstrate how our method reveals hidden bias, we conduct a case study with p = (actor, actress) and show the relationship between the bias B p,k and the neighborhood distance k. We present the histograms for F soft (x; p) in Fig. 6 and plot the corresponding B p,k vs. k in the right-most panel. Surprisingly, for the base model, the bias is negative when k = 0, but becomes positive when k = 3. This is because the naive approach only has two test examples (Table 5) thus the measurement is not robust. In contrast, our method is able to construct 141,780 similar natural sentences when k = 3 and shifts the distribution to the right (positive). As shown in the right-most panel, the bias is small when k = 1, and becomes more significant as k increases (larger neighborhood). As discussed in §4.1, the neighborhood construction does not introduce additional bias, and these results demonstrate the effectiveness of our method in revealing hidden model bias.

Related Work
First-order robustness evaluation.
A line of work has been proposed to study the vulnerability of natural language models, through transformations such as character-level perturbations (Ebrahimi et al., 2018) (Zhao et al., 2018b). They focus on constructing adversarial examples from the test set that alter the prediction, whereas our methods focus on finding vulnerable examples beyond the test set whose prediction can be altered. Robustness beyond the test set. Several works have studied model robustness beyond test sets but mostly focused on computer vision tasks. Zhang et al. (2019) demonstrate that a robustly trained model could still be vulnerable to small perturbations if the input comes from a distribution only slightly different than a normal test set (e.g., images with slightly different contrasts). Hendrycks and Dietterich (2019) study more sources of common corruptions such as brightness, motion blur and fog. Unlike in computer vision where simple image transformations can be used, in our natural language setting, generating a valid example beyond test set is more challenging because language semantics and grammar must be maintained.

Conclusion
This work proposes the double perturbation framework to identify model weaknesses beyond the test dataset, and study a stronger notion of robustness and counterfactual bias. We hope that our work can stimulate the research on further improving the robustness and fairness of natural language models.
Intended use. One primary goal of NLP models is the generalization to real-world inputs. However, existing test datasets and templates are often not comprehensive, and thus it is difficult to evaluate real-world performance (Recht et al., 2019;Ribeiro et al., 2020). Our work sheds a light on quantifying performance for inputs beyond the test dataset and help uncover model weaknesses prior to the realworld deployment. Misuse potential. Similar to other existing adversarial attack methods (Ebrahimi et al., 2018;Jin et al., 2019;Zhao et al., 2018b), our second-order attacks can be used for finding vulnerable examples to a NLP system. Therefore, it is essential to study how to improve the robustness of NLP models against second-order attacks. Limitations. While the core idea about the double perturbation framework is general, in §4, we consider only binary gender in the analysis of counterfactual fairness due to the restriction of the English corpus we used, which only have words associated with binary gender such as he/she, waiter/waitress, etc.

A.1 Random Baseline
To validate the effectiveness of minimizing Eq. (4), we also experiment on a second-order baseline that constructs vulnerable examples by randomly replacing up to 6 words. We use the same masked language model and threshold as SO-Beam such that they share a similar neighborhood. We perform the attack on the same models as Table 2, and the attack success rates on robustly trained BoW, CNN, LSTM, and Transformers are 18.8%, 22.3%, 15.2%, and 25.1%, respectively. Despite being a second-order attack, the random baseline has low attack success rates thus demonstrates the effectiveness of SO-Beam.

A.2 Human Evaluation
We Semantic similarity after the synonym substitution. We first ask the annotators to predict the sentiment on a Likert scale of 1-5, and then map the prediction to three categories: negative, neutral, and positive. We consider two examples to have the same semantic meaning if and only if they are both positive or negative.

A.3 Running Time
We experiment with the validation split on a single RTX 3090, and measure the average running time per example. As shown in Table 6, SO-Beam runs faster than SO-Enum since it utilizes the probability output. The running time may increase if the model has improved second-order robustness.   (Table 7). After looking into the training set, it turns out that straight to video is a common phrase to criticize a film, thus the classifier incorrectly correlates straight with negative sentiment. This also reveals the limitation of our method on polysemous words.  In Fig. 8, we measure the bias on X test and observe positive bias on most tokens for both k = 0 and k = 3, which indicates that the model "tends" to make more positive predictions for examples containing certain female pronouns than male pro- nouns. Notice that even though gender swap mitigates the bias to some extent, it is still difficult to fully eliminate the bias. This is probably caused by tuples like (him, his, her) which cannot be swapped perfectly, and requires additional processing such as part-of-speech resolving (Zhao et al., 2018a). To help evaluate the naturalness of our constructed examples used in §4, we provide sample sentences in Table 9 and Table 10. Bold words are the corresponding patch words p, taken from the predefined list of gendered pronouns.  SO-Beam on base LSTM, and Table 12 shows additional attack results from SO-Beam on robust CNN. Bold words are the corresponding patch words p, taken from the predefined list of counter-fitted synonyms.

Type Predictions Text
Original 95% Negative 94% Negative it 's hampered by a lifetime-channel kind of plot and a lead actor (actress) who is out of their depth .
Distance k = 1 97% Negative (97% Negative) it 's hampered by a lifetime-channel kind of plot and lone lead actor (actress) who is out of their depth . 56% Negative (55% Positive ) it 's hampered by a lifetime-channel kind of plot and a lead actor (actress) who is out of creative depth . 89% Negative (84% Negative) it 's hampered by a lifetime-channel kind of plot and a lead actor (actress) who talks out of their depth . 98% Negative (98% Negative) it 's hampered by a lifetime-channel kind of plot and a lead actor (actress) who is out of production depth . 96% Negative (96% Negative) it 's hampered by a lifetime-channel kind of plot and a lead actor (actress) that is out of their depth .
Distance k = 2 88% Negative (87% Negative) it 's hampered by a lifetime-channel cast of stars and a lead actor (actress) who is out of their depth . 96% Negative (95% Negative) it 's hampered by a simple set of plot and a lead actor (actress) who is out of their depth . 54% Negative (54% Negative) it 's framed about a lifetime-channel kind of plot and a lead actor (actress) who is out of their depth . 90% Negative (88% Negative) it 's hampered by a lifetime-channel mix between plot and a lead actor (actress) who is out of their depth . 78% Negative (68% Negative) it 's hampered by a lifetime-channel kind of plot and a lead actor (actress) who storms out of their mind .
Distance k = 3 52% Positive (64% Positive ) it 's characterized by a lifetime-channel combination comedy plot and a lead actor (actress) who is out of their depth . 93% Negative (93% Negative) it 's hampered by a lifetime-channel kind of star and a lead actor (actress) who falls out of their depth . 58% Negative (57% Negative) it 's hampered by a tough kind of singer and a lead actor (actress) who is out of their teens . 70% Negative (52% Negative) it 's hampered with a lifetime-channel kind of plot and a lead actor (actress) who operates regardless of their depth . 58% Negative (53% Positive ) it 's hampered with a lifetime-channel cast of plot and a lead actor (actress) who is out of creative depth .

Type Predictions Text
Original 54% Positive (69% Positive ) for the most part , director anne-sophie birot 's first feature is a sensitive , overly (extraordinarily) well-acted drama . Vulnerable 53% Negative (62% Positive ) for the most part , director anne-sophie benoit 's first feature is a sensitive , overly (extraordinarily) well-acted drama .
Original 66% Positive (72% Positive ) mr. tsai is a very original painter (artist) in his medium , and what time is it there ? Vulnerable 52% Negative (55% Positive ) mr. tsai is a very original painter (artist) in his medium , and what time was it there ?
Original 80% Positive (64% Positive ) sade is an engaging (engage) look at the controversial eponymous and fiercely atheistic hero . Vulnerable 53% Positive (66% Negative) sade is an engaging (engage) look at the controversial eponymous or fiercely atheistic hero .
Original 50% Negative (57% Negative) so devoid of any kind of comprehensible (intelligible) story that it makes films like xxx and collateral damage seem like thoughtful treatises Vulnerable 53% Positive (54% Negative) so devoid of any kind of comprehensible (intelligible) story that it makes films like xxx and collateral 2 seem like thoughtful treatises Original 90% Positive (87% Positive ) a tender , heartfelt (deepest) family drama . Vulnerable 60% Positive (61% Negative) a somber , heartfelt (deepest) funeral drama .

Original
57% Positive (69% Positive ) ... a hollow joke (giggle) told by a cinematic gymnast having too much fun embellishing the misanthropic tale to actually engage it . Vulnerable 56% Negative (56% Positive ) ... a hollow joke (giggle) told by a cinematic gymnast having too much fun embellishing the misanthropic tale cannot actually engage it .
Original 73% Negative (56% Negative) the cold (colder) turkey would 've been a far better title . Vulnerable 61% Negative (62% Positive ) the cold (colder) turkey might 've been a far better title .

Original
70% Negative (65% Negative) it 's just disappointingly superficial -a movie that has all the elements necessary to be a fascinating , involving character study , but never does more than scratch the shallow (surface) . Vulnerable 52% Negative (55% Positive ) it 's just disappointingly short -a movie that has all the elements necessary to be a fascinating , involving character study , but never does more than scratch the shallow (surface) .
Original 79% Negative (72% Negative) schaeffer has to find some hook on which to hang his persistently useless movies , and it might as well be the resuscitation (revival) of the middleaged character . Vulnerable 57% Negative (57% Positive ) schaeffer has to find some hook on which to hang his persistently entertaining movies , and it might as well be the resuscitation (revival) of the middleaged character .
Original 64% Positive (58% Positive ) the primitive force of this film seems to bubble up from the vast collective memory of the combatants (militants) . Vulnerable 52% Positive (53% Negative) the primitive force of this film seems to bubble down from the vast collective memory of the combatants (militants) .
Original 64% Positive (74% Positive ) on this troublesome (tricky) topic , tadpole is very much a step in the right direction , with its blend of frankness , civility and compassion . Vulnerable 55% Negative (56% Positive ) on this troublesome (tricky) topic , tadpole is very much a step in the right direction , losing its blend of frankness , civility and compassion .
Original 74% Positive (60% Positive ) if you 're hard (laborious) up for raunchy college humor , this is your ticket right here . Vulnerable 60% Positive (57% Negative) if you 're hard (laborious) up for raunchy college humor , this is your ticket holder here .