Modeling the Severity of Complaints in Social Media

The speech act of complaining is used by humans to communicate a negative mismatch between reality and expectations as a reaction to an unfavorable situation. Linguistic theory of pragmatics categorizes complaints into various severity levels based on the face-threat that the complainer is willing to undertake. This is particularly useful for understanding the intent of complainers and how humans develop suitable apology strategies. In this paper, we study the severity level of complaints for the first time in computational linguistics. To facilitate this, we enrich a publicly available data set of complaints with four severity categories and train different transformer-based networks combined with linguistic information achieving 55.7 macro F1. We also jointly model binary complaint classification and complaint severity in a multi-task setting achieving new state-of-the-art results on binary complaint detection reaching up to 88.2 macro F1. Finally, we present a qualitative analysis of the behavior of our models in predicting complaint severity levels.


Introduction
Complaining is a speech act that usually conveys negative emotions triggered by a discrepancy between reality and expectations towards an entity or event (Olshtain and Weinbach, 1985). Complaints play an important role in human communication for expressing dissatisfaction. Based on complainers' personalities and specific situations, expression of complaints vary from person to person (Vásquez, 2011).
In pragmatics, complaints have been classified into various levels of severity according to their emotional intensity, the amount of face-threat that the complainer is willing to undertake and their purpose (Olshtain and Weinbach, 1985;Trosborg, 2011;Kakolaki and Shahrokhi, 2016). Complaining purposes might include the expression of dissatisfaction, to find solutions (e.g. ask for reparations) or both. Furthermore, a complaint can be categorized as implicit (i.e. without mentioning who is responsible) or explicit (i.e. accusing someone for doing something).
Recent work on modeling complaints in natural language processing (NLP) has focused on distinguishing complaints from non-complaints in social media (Preotiuc-Pietro et al., 2019;Jin and Aletras, 2020), however there is no previous study into more fine-grained complaint categories. Table 1 shows examples of social media posts expressing complaints grouped into four severity classes according to Trosborg (2011): (a) no explicit reproach; (b) disapproval; (c) accusation; and (d) blame.
Identifying and analyzing the severity of complaints is important for: (a) improving customer service by recognizing the level of dissatisfaction and understanding complainers' needs (Van Noort and Willemsen, 2012); (b) linguists to study the speech act of complaints in different levels of granularity on large scale (Tatsuki, 2000); and (c) developing downstream NLP applications such as automatic complaint response generation (Xu et al., 2017) or voting stance prediction (Tsakalidis et al., 2018).
In this paper, we present a systematic study on analyzing complaint categories with computational methods for the first time in computational linguistics. Our main contributions are as follows: • Grounded in linguistic theory of pragmatics (Trosborg, 2011), we enrich a publicly available data set (Preotiuc-Pietro et al., 2019) with four complaint severity levels; • We create a new classification task for identifying different severity levels of complaints;   (Trosborg, 2011).
• We evaluate transformer-based classification models (Vaswani et al., 2017) combined with linguistic information on (a) complaint severity level classification; and (b) binary complaint detection in a multi-task setting achieving new state-of-the-art results. Moreover, the difference between very direct and somewhat direct is that the former highlights the responsibility of the complaint receiver while the latter does not.

Complaint Analysis
Most of the existing studies on complaint classification in NLP have explored different approaches to the complaint identification task (identifing complaints from non-complaints) in various domains, starting with feature-based machine learning models (Coussement and Van den Poel, 2008;Preotiuc-Pietro et al., 2019) and deep learning methods (Jin and Aletras, 2020 (Forster and Entrup, 2017;Merson and Mary, 2017) or responsible departments (Laksana and Purwarianti, 2014;Gunawan et al., 2018;Tjandra et al., 2015). Furthermore, other complaint related categorizations are based on product hazards and risks (Bhat and Culotta, 2017), service failure (Jin et al., 2013) and escalation likelihood (Yang et al., 2019).

Emotion Detection
Most related to complaint severity is emotion detection and its intensity which have been extensively studied in NLP (Danisman and Alpkocak, 2008;Volkova and Bachrach, 2016;. More recently, Alejo et al. (2020) explored cross-lingual transfer approaches to predict emotion intensity in Twitter. Similarly, Akhtar et al.
(2020) evaluated a series of feature-based machine learning models for both emotion and sentiment intensity prediction in social and news media.

Task & Data
We define complaint severity prediction as a multiclass classification task. Given a text snippet T , defined as a sequence of tokens T = {t 1 , ..., t n }, the aim is to classify T as one of the four predefined severity labels. We use an existing complaints data set developed by Preotiuc-Pietro et al. (2019), which consists of 1,235 complaints (35.8%) and 2,214 noncomplaints (64.2%) in English. We opted using this data set because it is publicly available with annotated complaints collected from Twitter in 9 general domains (i.e. Food, Apparel, Retail, Cars, Service, Software, Transport, Electronics and Other).

Complaint Severity Categories
For complaint severity annotation, we adopt the four categories defined by Trosborg (2011) because it is considered as the 'standard' in pragmatics literature (see examples in Table 1): • No explicit reproach: there is no explicit mention of the cause and the complaint is not offensive; • Disapproval: express explicit negative emotions such as dissatisfaction, annoyance, dislike and disapproval; • Accusation: asserts that someone did something reprehensible; • Blame: assumes the complainee is responsible for the undesirable result.
Note that the severity levels categorize complaints by type instead of intensity. Classes are disjoint according to Trosborg (2011). More specifically, 'No explicit reproach' is a suggestive strategy, where the complainee is usually not mentioned in the statement. 'Disapproval' expresses negative sentiment or unsatisfying state only. The statement may imply the complainer holds the complainee responsible but avoid mentioning it, which is the key component of identifying 'Disapproval' and 'Accusation'/'Blame'. The main difference between 'Accusation' and 'Blame' is in the latter one the complainer presupposes the complainee is guilty of the offense.

Complaint Severity Annotation
Following the definitions above, each tweet was labeled by three annotators independently. In case of ties, the final decision was made by the authors through consensus. We recruited 35 native English speaking annotators from the volunteers list of our institution. 3,4 The inter-annotator agreement between the three original annotations for each tweet is k = 0.64 Fleiss' Kappa 5 (Fleiss, 1971) which belongs to substantial agreement (Artstein and Poesio, 2008). Table 2 shows the distribution of tweets across classes: 435 tweets belong to 'No Explicit Reproach' (35.2%), 378 belong to 'Disapproval' 3 We have received approval from the Ethics Committee of our institution. 4 Annotators are provided with an introduction of the task including definitions and examples of each category. 5 We randomize the order of three annotations for each tweet three times and compute the average Fleiss' Kappa.

Text Processing
Text is processed by lower-casing, and replacing all mentions of usernames and URLs with placeholder tokens. A Twitter-aware tokenizer, DLATK (Schwartz et al., 2017), is used for text tokenization to handle emoticons and hashtags in social media text.

Predictive Models
Since severity complaint prediction is a new task, we first evaluate the majority class as well as three strong baselines: (1) logistic regression with bagof-words; (2) a bidirectional recurrent neural network trained from scratch; and (3) finetuning a pretrained transformer-based model. Furthermore, we combine linguistic information (i.e. emotion and topic information) to a transformer-based model similar to the method proposed by Jin and Aletras (2020) in the context of binary complaint classification.

Baselines
Majority Class We use Majority Class as the first baseline, where we calculate scores by labeling all the tweets with the majority class.
LR-BOW We use a linear baseline, Logistic Regression with standard bag-of-words (LR-BOW) and L2 regularization.
BiGRU-Att We also use a neural baseline trained from scratch; a bidirectional Gated Recurrent Unit (GRU) network (Cho et al., 2014) with a selfattention mechanism (BiGRU-Att; (Tian et al., 2018)). Given a Twitter post T , a token t i is mapped to a GloVe embedding (Pennington et al., 2014). We then apply dropout to the output of GloVe embedding layer and pass it to a bidirectional GRU with self-attention layer. Finally, the contextualized token representations are passed to an output layer using a softmax activation function for multi-class classification.
RoBERTa Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) is a pre-trained language model based on the Transformer architecture (Vaswani et al., 2017). It makes use of multiple multi-head attention layers to learn context information from both the left and the right side of tokens. It is trained on masked language modeling by randomly masking some of the tokens from the input aiming to predict them based on the context only. RoBERTa  is an extension of BERT trained on more data with different hyperparameters and has achieved better performance in social media analysis tasks (Maronikolakis et al., 2020). We fine-tune RoBERTa 6 on complaint severity classification by adding an output dense layer with a softmax activation function.

M-RoBERTa with Linguistic Information
Multimodal-BERT (M-BERT) (Rahman et al., 2019) injects multimodal information such as image and speech into the text representations of BERT. It combines word embeddings and embeddings from other modalities (e.g. image, audio) which are then fed to a BERT encoder. M-BERT has been recently adapted by Jin and Aletras (2020) for binary complaint prediction by inducing linguistic information instead of speech and image, however it did not perform better than BERT in their setting. We adapt M-BERT by replacing (1) the underlying BERT model with RoBERTa; and (2) the multimodal information with linguistic information. We first use a fully connected layer to project the linguistic representations into vectors with comparable size to RoBERTa's embeddings. Then we concatenate word representations from RoBERTa and linguistic information representations using a Multimodal Shifting Gate , where an attention gating mechanism is applied to control the influence of each representation. Finally, we apply layer normalization and dropout after the Multimodal Shifting Gate and pass the output to RoBERTa. We add an output layer to M-RoBERTa for classification similar to the RoBERTa model. We use M-RoBERTa with three types of linguistic features (i.e. emotion, topic and their combination): M-RoBERTa Emo We first use emotional information obtained by using a pretrained emotional classifier by Volkova and Bachrach (2016). This is 9-dimensional vector representing scores of sentiment (positive, negative and neutral) and six basic emotions of Ekman (1992) (anger, disgust, fear, joy, sadness and surprise).

M-RoBERTa T op
We also use topical information from a 200-dimensional vector representing the distribution of the fraction of tokens in each tweet belonging to a topic cluster (Preoţiuc-Pietro et al., 2015;Aletras and Chamberlain, 2018).

M-RoBERTa Emo+T op
We finally experiment with injecting both emotional and topical information to M-RoBERTa.

Training and Evaluation
We run all models using a nested 10-fold cross validation approach, which consists of 2 nested loops as in Preotiuc-Pietro et al. (2019). In the outer loop, 9-folds are used for training and one for testing; while in the inner loop, a 3-fold cross validation is applied on the data from the nine folds (in the outer loop), where 2-folds are used for training and one for validation. During training, we choose the model with the smallest validation loss over 30 epochs. We measure predictive performance using the mean Accuracy, Precision, Recall and macro F1 over 10 folds (we also report the standard deviations). Table 3 shows the performance of all models including baselines and M-RoBERTa combined with linguistic information on complaint severity level prediction.

Results
Overall, M-RoBERTa models with linguistic features achieve best results. M-RoBERTa Emo outperforms all other models and reaches macro F1 up to 55.7. This confirms out hypothesis that injecting extra emotion information helps improve the performance of complaint severity level prediction. This is also in line with Trosborg (2011) who states that the expression of complaints is relevant to different emotional states. The results of M-RoBERTa T op and M-RoBERTa Emo+T op are comparable with 55.2 and 55.5 macro F1 respectively. RoBERTa performs competitively but worse than the M-RoBERTa models. We also notice that BiGRU-Att does not perform well in our task (43.5 macro F1), which may result from the fact that it has not been pretrained. Figure 1 presents the confusion matrix of our best model (i.e. M-RoBERTa Emo ). The confusion matrix is normalized over the actual values (rows). The 'No Explicit Reproach' category has the highest percentage (77.2%) of correctly classified data points by the model, followed by label 'Disapproval' with 59.0%. These are also the two most frequent classes in the data set. On the other hand, results on 'Accusation' are the lowest (32.9%) which is confused with adjacent categories ('Disapproval' and 'Blame'). Furthermore, the differences between mis-classifications and correct classification are relatively large for 'Blame'. We speculate that this is because of the unique linguistic characteristic of the 'Blame' category which gives emphasis on someone's responsibility. Finally, a category is more likely, in general, to be mis-classified to its adjacent severity categories. For example, when predicting 'Disapproval', the number of model mis-classifications as 'No Ex- plicit Reproach' and 'Accusation' is larger than 'Blame'. This hints that tweets belonging to neighboring levels share more semantic, syntactic and stylistic similarities. We also compare the performance of our best model (i.e. M-RoBERTa Emo ) with human agreement for each class (Figure 2). In general, the results of the model (shown in Figure 1) correlate to human agreement. In other words, the model and humans agree in the categories they confuse. For instance, it is easy for both of them to confuse 'Accusation' with 'Disapproval' (32.9% vs. 31.1% for the model and 43.6% vs. 31.% for humans). However, we observe that annotators are better at distinguishing high severity complaints from 'No Explicit Reproach', where 21.2% 'Disapproval' and 12.4% 'Accusation' are wrongly classified as 'No Explicit Reproach' by the model while the corresponding values are 18.5% and 8.9% by humans respectively. We argue that this is because annotators are able to identify subtle language (More details will be discussed in Section 7). Also, we notice that the model achieves better performance when predicting 'Blame', indicating a better capability on capturing the main characteristics of this class compared to humans.

Error Analysis
We perform an error analysis to shed light on the limitations of our best performing model (M-RoBERTa Emo ) on complaint severity level classification.
Firstly, we observe that most errors happen when the differences of tweets belonging to 'Accusation' are blurred with 'Disapproval' and 'Blame'. The following two tweets are typical examples for 'Accusation' being mis-classified as 'Disapproval' and 'Blame' respectively: <USER>, thank you ! Clear guidelines here, but not at all what your advisor on the phone stated!
The new <USER> stinks ...10mins to take my order and another 15 to get it. And stop asking my name like we're friends <URL> This is because some tweets belonging to 'Accusation' also contain negation (e.g. 'not at all') or negative terms (e.g. 'disappointed'), which appear frequently in 'Disapproval'. Also, consistent with the definition by Trosborg (2011) (directly or indirectly accuses someone for causing the problem), tweets belonging to 'Accusation' may involve doing something and contain terms like '<USER>' or 'you', which is similar to complaints labeled as 'Blame' such as: Thanks <USER> for selling expired beer #fail <USER> <URL> Secondly, the model struggles with complaints expressed in more subtle ways. In the following two examples, tweets belonging to 'Disapproval' and 'Accusation' are mis-classified as 'No Explicit Reproach' respectively: Think someone at <USER> had been drinking the stuff before they put the label on Just opened a fresh bud light that was filled with water. Please explain <USER>.
Such complaints do not contain terms that are typical of any specific complaint severity category (e.g. negation and negative terms in 'Disapproval', person pronouns and terms describing undesirable results in 'Blame') thus predicting them correctly needs more contextual understanding.
Finally, compared to other categories, the model is more likely to confuse tweets belonging to 'No Explicit Reproach' and 'Disapproval'. This happens because some tweets express weak dissatisfaction, which is difficult to identify. The following tweet is mis-classified as 'No Explicit Reproach': Dearest <USER>: there really needs to be an easier method to report names that are inappropriate <URL> The model might need to learn more contextual information about such tweets instead of capturing certain relevant terms. Also, these two labels contain more similar terms such as 'dm', 'please help', 'can't work' and interrogative tone. Examples of a 'No Explicit Reproach' and 'Disapproval' are the following (where similarities are in bold): Hey guys, I love this product featured on <USER> today but don't see a price? Help a girl out? <URL> So it's going to cost $7000 to fix the exhaust on my <USER> 2009 jetta, and only $300 is covered under warranty. Help <USER>?

Multi-task Learning for Binary Complaint Prediction
We further experiment with multi-task learning (MTL) (Caruana, 1997) for using severity categories to improve binary complaint prediction (i.e. complaint or non-complaint). MTL enables two or more tasks to be learned jointly by sharing information and parameters of a model. We explore whether or not the severity level of a complaint helps in complaint identification. We use the same data set as Preotiuc-Pietro et al. (2019), where each tweet is annotated as a complaint or not and our severity level annotations. 7

Predictive models
We first adapt three multi-task learning models based on bidirectional recurrent neural networks recently proposed by Rajamanickam et al. (2020) for jointly modeling abusive language detection and emotion detection. We also adapt our M-RoBERTa Emo model in a multi-task setting using two variants. We use the severity complaint prediction as an auxiliary task and the binary complaint prediction as the main task to train different MTL models. All models are trained on the two tasks and updated at the same time with a joint loss: where L com and L sev are the losses of complaint identification and severity level classification tasks respectively. α is a parameter to control the importance of each loss.

MTL-Hard Sharing
We adapt the MTL-Hard Sharing model of Rajamanickam et al. (2020), where a single encoder is shared and updated by both tasks. We first pass GloVe embedding representations to a shared stacked BiGRU encoder. Then the output of the shared encoder is fed to two different BiGRU-Att models specific to each task (complaint detection and severity level identification) separately. Finally, we add an output layer with a sigmoid and a softmax activation function for binary and multi-class prediction respectively.

MTL-Double Encoder
Instead of sharing a single encoder, the MTL-Double Encoder model (Rajamanickam et al., 2020) utilizes two stacked Bi-GRU encoders, where one is task-specific (complaint detection only) and the another one is shared by both tasks. We pass the output of the shared encoder to a BiGRU-Att model for severity level prediction. We also concatenate the output of the task-specific and shared encoder and pass it to another BiGRU-Att model for complaint prediction. The rest of the architecture is the same as the MTL-Hard Sharing model.

MTL-Gated Double Encoder
The MTL-Gated Double Encoder model (Rajamanickam et al., 2020) has the same architecture as the MTL-Double Encoder. The outputs from two stacked BiBRU-Att encoders are concatenated by assigning a weight to each representation [(1 − β) for the output of the task-specific encoder layer and β for the output of the shared one)] that controls the importance of the two representations.
Hyperparameters We train 9 the MTL-BiGRU-Att and MTL-BiGRU-Att-DE model with the same hyperparameters as BiGRU-Att in complaint severity prediction. For the MTL-Hard Sharing, MTL-Double Encoder and MTL-Gated Double Encoder model, the hidden size of the stacked Bi-GRU encoder(s) and BiGRU-Att models is h = 128, h ∈ {64, 128, 256, 512}. We set β in MTL-Gated Double Encoder and the remaining parameters in three models to be the same as Rajamanickam et al. (2020). We train MTL-M-RoBERTa Emo and MTL-M-RoBERTa Emo -DE with a learning rate l = 1e-6, l ∈ {1e-5, 5e-6, 1e-6}. The rest of the parameters is the same as M-RoBERTa Emo in the complaint severity prediction. The parameter α which controls the importance of the two losses is set to .1, α ∈ {.001, .01, .1, .3, .5}. Table 4 shows results of the single-task learning (STL) and multi-task learning (MTL) models on the complaint identification task. Overall, we observe that all MTL models using M-RoBERTa Emo perform better than the majority of STL models, indicating severity detection improves binary complaint identification. MTL-M-RoBERTa Emo outperforms all other models achieving 88.2 macro F1, followed by MTL-M-RoBERTa Emo -DE with 9 We experiment with all MTL models using the same training and evaluation method as in the complaint severity prediction task. 88.1 F1. This confirms our hypothesis that complaint identification can be benefited by the complaint severity level information when jointly learning these two tasks simultaneously. Also, MTL-BiGRU-Att performs better than BiGRU-Att in STL achieving 75.4 F1 while the results of BiGRU-Att (74.5 F1) and MTL-BiGRU-Att-DE (74.1 F1) are comparable. We notice that the models proposed by Rajamanickam et al. (2020) (i.e. MTL-Hard sharing, MTL-Double Encoder and MTL-Gated Double Encoder) achieve low performance with only the MTL-Hard Sharing model performing slightly better than the others with 72.1 macro F1. We speculate that adding one or more extra BiGRU encoders before the BiGRU-Att model is an overly complex structure for our data set.

Analysis
We investigate the influence of recognizing severity levels of complaints on binary complaint identification in our MTL setting. We analyze predictive results by inspecting predictions from the previous best performing model BERT (STL) and MTL-M-RoBERTa Emo models in a random fold (out of 10 CV folds). We observe that 9.8% of predictions flip, where the number of complaints flipping to noncomplaints is noticeably larger (88.2%) than that of non-complaints flipping to complaints (11.8%). Similarly, we also compare predicted results between BiGRU-Att (STL) and MTL-BiGRU-Att in the same fold. The flipping percentage (6.9%) is lower than BERT and MTL-M-RoBERTa Emo while the proportions of one class flipping to an-  other are consistent (83.4% and 16.6% respectively). These indicate that complaint severity information encapsulates complementary information for the model to predict non-complaints accurately. Table 5 shows flipping examples from BERT (STL) and MTL-M-RoBERTa Emo . From the first two rows, we see that the MTL model is not affected by negation (e.g. 'never') and negative terms (e.g. 'bad', 'very low') using the extra knowledge provided by the severity level prediction task. Also, in the last two examples, complaints are expressed in a more subtle way that rarely contains typical complaint-related terms. This indicates the MTL model is able to detect this type of complaints correctly because the severity level information encourages the model to learn to distinguish between such stylistic idiosyncrasies.
We further observe that 11.2% of wrong predictions remain the same for the two models, where complaints and non-complaints account for 59.0% and 41.0% respectively which means severity features benefit more posts that are complaints to be classified accurately. On the other hand, the model still has difficulty in predicting some non-complaint posts which might happen because of the lower performance of severity detection 10 when used as an auxiliary task in the MTL setting.

Conclusion
We presented the first study on severity level of complaints in computational linguistics. We developed a publicly available data set of tweets labeled with four categories based on theory of pragmatics. We modeled complaint severity level prediction as a new multi-class classification task and conducted experiments using different transformer-based networks combined with linguistic features reaching 10 Severity prediction is less accurate in MTL than in a single task setting. up to 55.7 macro F1. We further used a multi-task learning setting to jointly model binary complaint prediction and complaint severity classification as an auxiliary task achieving new state-of-the-art performance on complaint detection (88.2 macro F1). In the future, we plan to apply our methods on a multilingual setting across different platforms.