Assessing Cognitive Linguistic Influences in the Assignment of Blame

Lab studies in cognition and the psychology of morality have proposed some thematic and linguistic factors that influence moral reasoning. This paper assesses how well the findings of these studies generalize to a large corpus of over 22,000 descriptions of fraught situations posted to a dedicated forum. At this social-media site, users judge whether or not an author is in the wrong with respect to the event that the author described. We find that, consistent with lab studies, there are statistically significant differences in uses of first-person passive voice, as well as first-person agents and patients, between descriptions of situations that receive different blame judgments. These features also aid performance in the task of predicting the eventual collective verdicts.


Introduction
Dyadic morality theory proposes that the harm one party causes another is an important component in how other people form judgments of the two parties as acting morally or not. Under this framework, perpetrators (agents) are perceived as blameworthy, whereas victims (patients) are not (Gray and Wegner, 2009;Schein et al., 2015). This effect appears to transfer to how active (agentive) a party is described to be, even if the activity was in the past -a phenomenon described by Gray and Wegner's (2011) paper titled, "To Escape Blame, Don't be a Hero -Be a Victim".
The online forum https://reddit.com/ r/AmItheAsshole collects first-person descriptions of (purportedly) real-life situations, together with commentary from other users as to who is blameworthy in the situation described; two examples are shown in Figure 1. (Additional examples may be found in Appendix A.) This data allows us to evaluate findings from dyadic morality theory on a corpus involving over 22,000 events and 685,000 passed judgments.
The research questions we address with this data in this paper include: (1) Do authors refer to themselves in passive voice more often in descriptions of situations where they are judged to be morally incorrect?
(2) How does an author's framing of themselves as an "agent" or "patient" in describing a moral situation affect the judgments they receive?
The first question is motivated by Bohner (2002), who found that using passive voice, by placing someone who was actually a victim in subject position (e.g., "X was threatened by Y"), causes the victim to seem more responsible for the event.
(See also Niemi and Young (2016) on the effect of syntactic-subject position for perpetrator vs. victim descriptions.) Importantly, our two questions together separate passive voice from agentiveness. We find that while the agentive aspect of dyadic morality theory is upheld in our data, passive voice theory does not align empirically. We also incorporate these theories as features in a verdict prediction task.

Data
The subreddit from which we draw our data is selfdescribed as follows: A catharsis for the frustrated moral philosopher in all of us, and a place to finally find out if you were wrong in an argument that's been bothering you. Tell us about any non-violent conflict you have experienced; give us both sides of the story, and find out if you're right, or you're the [jerk].
It has served as the basis of prior computational analysis of moral judgment by Botzer et al. (2021) and Lourie et al. (2021) (Emelin et al., 2020) would have been an interesting alternative corpus to work with. It also draws some of its situations from the same subreddit.) Since the SCRUPLES dataset (Lourie et al., 2021), also based on the aforementioned subreddit, does not include corresponding full comments, which we wanted to have as an additional source of analysis, 1 we scraped the subreddit ourselves. Our dataset (henceforth AITA) includes posts from the same timeframe as SCRUPLES: November 2018-April 2019.
The winning verdict of each post is determined, according to the subreddit's rules, by the verdict espoused by the top-voted comment 18 hours after submission. We aim to only include posts with meaningful content, so we discard posts with fewer than 20 comments and fewer than 6 words in the body, as manual appraisal revealed that these were often uninformative (e.g., body is "As described in the title").
For simplicity, we only consider situations with the YTA (author in the wrong, other party in the right) and NTA (author in the right, other party in the wrong) verdicts, although other verdicts (such as "everyone is in the wrong", and "no one is in the wrong") are possible. We still use over 75% of the data since these are the most prevalent outcomes on the forum, and the theories we assess align with having binary outcomes (comparing victim vs. perpetrator responsibility). This selection results in 22,795 posts, fewer than the over 32,000 in SCRU-PLES (Lourie et al., 2021). The corpus contains more NTA posts, which are longer in word length on average (see Table 1).

Methodology
Passive subject identification To model the use of passive voice in moral situations, a dependency parser is used to match spans of passive subjects in sentences. We use spaCy's Matcher object to extract tokens tagged nsubjpass. Cases where the extracted passive subject is in first person (1P) are also tracked, as indication of the author being referred to passively. Some examples include: • 1P passive subject: I was asked to be a bridesmaid and then she changed her mind last minute and I was removed from the bridal party in favor of one of her husbands cousins.
• Other passive subject: She obliged but she was pissed off the rest of the night.
For manual evaluation, we randomly selected 500 posts, containing a total of 675 uses of passive voice. The tagger achieved 0.984 precision on these posts. Among the 199 first-person passive subjects tagged, the precision achieved was 0.971.
Because Niemi and Young (2016) and Bohner (2002) find that passive voice is associated with greater perception of victims' causal responsibility, we hypothesize that situations with the YTA verdict may have higher rates of 1P passive subject usage.
Thematic role identification To approximate moral agents vs. patients, Semantic Role Labelling (SRL) is used to extract agents and patients. Semantic, or thematic, roles express the roles taken by arguments of a predicate in an event; an agent is the volitional causer of the event, while the theme or patient is most affected by the event (Jurafsky and Martin, 2019). The AllenNLP BERT-based Semantic Role Labeller (Gardner et al., 2017;Shi and Lin, 2019) is employed to extract spans that are tagged ARG0 for agents and ARG1 for patients. We also tag uses of 1P-agents and patients. Here are two examples: • 1P agent: I don't want my fiance to take care these freeloaders anymore.
• 1P patient: He called me names, threatened divorce, and told me he's a saint for staying married to me.
As a sanity check, we manually evaluated a subset of 579 verb frames, corresponding to 15 posts, identified by the SRL tagger. The tagger achieved a precision of 0.934 on all verb frames. The precision on the 193 verb frames in this subset that contained a first-person ARG0 or ARG1 was 0.891. Examples where the tagger failed include sentence fragments (e.g. "Made my MMA debut today.") and use of first-person pronouns to describe other parties (e.g. "Everyone we know"). Gray and Wegner (2011) concluded that "it pays to be a [patient] when trying to escape blame.
[Agents],... depending on the situation, may actually earn increased blame." Thus, we hypothesize that NTA may be associated with higher 1P-patient usage and YTA with higher 1P-agent usage.

Statistical Analysis
Due to the post length discrepancy between verdicts, we attempt to control for length in the analysis by assessing significance at the sentence level. While NTA posts average approximately 50 words more than YTA posts, sentences from NTA posts average only 0.5 words more than YTA sentences (17.0 vs. 16.6 words respectively).
We assess statistical significance as follows. We use a simple binomial test: let r be the rate of the given feature of interest (say, 1P-passive voice) over the entire collection of posts. We then compute the probability according to the r-induced binomial distribution -i.e., the null hypothesis that there is no difference between the YTA posts and the body of posts overall -of the observed number of occurrences of the feature in just the YTA posts. Similarly, we compute this probability for just the NTA posts.

Passive Subject Identification
We find that NTA situations have a higher rate of 1P passive subject usage than YTA situations, and that the deviation of both the rate in the YTA posts and in the NTA posts from the overall data is statistically significant. As shown in Table 2, 45.8% of NTA posts' passive voice uses are 1P, while 37.4% of YTA posts' passive voice uses are 1P.
The rate difference across verdicts is significant, with NTA posts having a higher 1P-passive rate (see Table 2). This could account for the 0.5-wordslonger sentence average of NTA posts; since, for example, "I hit John" is shorter than its passive counterpart, "John was hit by me." This contradicts our hypothesis, as we expected higher 1P-passive rates for YTA posts.
We do not discount a possible explanation for this differing result being that the cognitive researchers had better control over narrative structure, content of their situations, and participants that provided judgment. On the other hand, it is also possible that the forum setting is, at least in certain respects, more natural (and definitely largerscale) than the lab setting in which the original experiments took place.

Thematic Role Identification
The NTA posts use more agents and patients by raw count and also have more verbs per post, since they are generally longer than YTA posts (see Table  3). When we examine proportions of uses, we find that the NTA posts have a higher rate of 1P-patient usage, while YTA posts have a higher rate of 1Pagent usage. While the verdicts do not differ significantly in overall agent and patient usage, there are significant differences in rates of 1P (see Table 4). The rate of 1P-patient usage in NTA posts is significantly higher than that of YTA posts (p < 0.005), while the rate of 1P-agent usage in YTA posts is significantly higher than that of NTA posts (p < 0.001). These results seem to align with our hypothesis based on Gray and Wegner (2011)'s findings.

Verdict prediction task
In the previous section, we examined statistical correlations between features of interest in the previous literature to the verdicts presented in our data.   The verbs identified by the tagger are highlighted in yellow, and the 1P ARG0 is highlighted in cyan. Note that the +Passive features are very sparse. First-person "me" and "us" were not added as +Passive features, as their inclusion yielded about 1% worse performance (likely since they added additional noise).
In this section, we turn to prediction as another way to examine the magnitude of potential linkages between these features and judgments of blame. In particular, we see how incorporating these quite small set of features compares against a baseline classifier that has access to many more (lexicalbased) features, but where these features are not explicitly cognitively motivated. Specifically, to analyze the significance of passive voice and thematic roles as features in making moral judgments, we model the task of predicting the verdict of a situation as binary classification (YTA or NTA). We compare the performance of a linear and non-linear model.
We stress that we are not striving to build the most accurate judgment predictor for moralscenario descriptions, nor arguing the utility or importance of such a classification task. Rather, we are using prediction as a further mechanism for answering the research questions we delineated in the introduction to this paper. We do not use BERT since it is pre-trained, possibly containing encoded biases, and is not as interpretable as simpler models. For an ablation study, we have four feature sets, with the corresponding number of features in brackets:  We assess these feature sets against 43,110 lexical-based features from a TF-IDF transform of lowercased unigrams and bigrams with 0.1% minimum document frequency.
The configurations for the linear and non-linear models are described below. The AITA data is split 60/20/20 for the train/val/test sets, after random shuffling. Both models are trained on this same split of data.

Linear Model
We opt for a simple model to begin with to avoid overfitting on the dataset and for purposes of interpretability. For a linear model, we use the scikit-learn logistic regression model (LR) (Pedregosa et al., 2011). Hyperparameters for the logistic regression model include setting random state to 0, choosing "liblinear" as the solver, and setting the class weights to "balanced" to account for the label imbalance.

Non-Linear Model
We incorporate a non-linear model, as we observed that our feature count distributions were weakly bi-modal even after grouping instances under the NTA/YTA labels (see Figure 3). We use the scikit-learn random forest model (RF) (Pedregosa et al., 2011). Hyperparameters for the random forest model include setting class weights to "balanced", "sqrt" for the maximum features, and 100 for number of estimators. Through tuning over the range [5,15], we found that setting the maximum depth to 7 prevented overfitting on the training data.

Task Results
To give the imbalanced labels equal importance, we evaluated macro-average scores. Weighted average scores were usually around 1% higher than the macro-average scores. Overall, the non-linear model achieves higher F1 scores for each of our feature sets, though the linear model does better with TF-IDF features (see Tables 5 and 6).

Linear Model Results
Compared to the random forest, the linear model achieves better performance with TF-IDF features (0.58 vs. 0.62 F1 score). Length+SRL has the best performance of our feature sets, with 0.56 precision and recall and 0.54 F1 score (see Table 5). The distinction in performance across feature sets is less clear than with the non-linear model, suggesting that the logistic regression model is not able to learn as well from these particular features.
From the ROC curves, we see that Length+SRL shows a little improvement over Length alone at higher thresholds, but does around equally at lower thresholds (see Figure 4a). The performance gap between TF-IDF features and our features is greater with the linear model.
We also see from the confusion matrix in Figure 4a that the best model version tends to predict YTA. Depending on desired use case -and recalling that we are not necessarily promoting judgment prediction as a deployed application -it may be better to err on the side of predicting one side or the other. If the priority is to catch all possible occurrences of the author being judged to be in the   wrong, this model would be better suited than the non-linear model. However, this model would also yield more false accusations, which could be more undesirable.

Non-Linear Model Results
Like the linear model, Length+SRL does best overall, with 0.56 for precision, recall, and F1 score (see Table 6). Length+Passive+SRL performs similarly.
From the ROC curves, we see that Length+SRL shows some improvement over Length alone (see Figure 4b). With these feature sets, we achieve performance close to that of a model trained with TF-IDF, with much fewer features: 43,110 vs. a mere 8. In addition, TF-IDF may overfit to topics (e.g., weddings), whereas our features are easier to transfer across domains.
From the confusion matrix in Figure 4b, we notice that even with balanced class labels, the best model still slightly favors predicting NTA.

Discussion
Despite noting the significant difference in firstperson passive voice usage between verdicts, the feature set of Length+Passive yields slightly lower performance than the Length baseline for both models. This could be due to not having enough instances of passive voice, as each post has on aver-age 1.39 counts of passive voice, of which 30.5% are first-person. Regex searches confirmed that the dependency-parser did not simply have poor recall, though the methods for passive voice extraction are not exact. Thus, the passive features may be acting as noise.
Length+SRL builds off of more SRL instances per post, so these features provide less noisy information. This feature set's performance beats that of the Length baseline for both models, suggesting that SRL features do play a role in making moral judgments. The SRL features do not store lexical information, which helps remove the influence of the content of the posts. Length+Passive+SRL performance likely suffers from the additional passive features' noise.
A notable difference is the non-linear model's tendency to favor NTA and the linear model's preference for YTA. A possible explanation for this is that the features corresponding to YTA situations are more linearly separable than those corresponding to NTA situations.
Comparing scores for the Length baseline, we see that the random forest has a 3.8% improvement in F1 score over logistic regression. This may suggest that post length is not a linear feature, which would account for nuances such as long YTA and short NTA posts (see Figure 1b for an example of a short NTA post).
Caveats We are certainly not saying that blameworthiness can be reduced to use of first-person descriptors. There are a multitude of features and factors at play, and there may be alternative parameters to consider for the task.
Even if we restrict attention to linguistic signals, there are quite a few confounds to point out. As just one example: it is possible that authors purposefully manipulate their use of first-person pronouns to appear less guilty. Another possibility to consider: there may be correlations between whether an author believes they are guilty and how they describe a situation, so that commenters are not picking up on the actual culpability in the described scenario so much as the author's self-blame.
Also, we can look beyond linguistic factors. For example, when deciding whether to "upvote" a particular judgment comment, voters may be affected by the (apparent) identity of the commenter (or, for that matter, the original post author) and the content of other comments. We have not accounted for such factors in our study.
(b) RF model. Figure 4: The ROC curves for verdict prediction task, for the feature sets described in the key, and the confusion matrices for the verdict prediction task results of the Length+SRL feature set.
We must also keep in mind that users of the forum constitute a particular sample of people that is likely not representative of many populations of interest.

Conclusion and Future Work
We introduce findings from moral cognitive science and psychology and assess their application to a forum of user-generated ethical situations. Statistical tests confirm that there are significant differences in usage of first-person passive voice along with first-person agents and patients among situations of different verdicts. Incorporating these differences as features in a verdict prediction task confirms the linkage between first-person agents and patients with assigned blame, though passive voice features appear too sparse to yield meaningful results.
From this study, we conclude that the manner in which a situation is described does appear to influence how blame is assigned. In the forum we work with, people seem to be judged by the way they present themselves, not just by their content, which aligns with previous cognitive science studies. Future endeavors in ethical AI could incorporate such theories to promote interpretability of models that produce moral decisions.
There are several areas of this project that could be refined and pursued further. We can repeat these experiments with the other verdicts, incorporating situations where all parties or no parties are blamed. We can use stricter length control than the sentencelevel comparison, since the average sentence length still differs between posts of different verdicts. We should also incorporate validation that the SRL methodology effectively extracts the moral agents and patients we are trying to analyze. Another direction we would like to pursue, and one also mentioned by a reviewer, is to group situations by topic to try to control for other confounds in the moral situations. Finally, we hope to be able to incorporate the range of votes from the comments accompanying each post to allow for more nuanced verdict prediction, as done with SCRUPLES in Lourie et al. (2021).

A Additional Examples
Warning: Some content in these examples may be offensive or upsetting. Figure 5 shows an example where there was relative disagreement about the guilty party. Figure 6 shows an example where there was general consensus about the verdict. Figure 5: A situation where there was noticeable disagreement among comments. Depicted is the top-rated comment and one additional, contrary opinion. Figure 6: A situation wherein other comments in general agreement with the final verdict (i.e., that of the top-rated comment). We show only one additional comment due to space constraints.