Complaint Identification in Social Media with Transformer Networks

Complaining is a speech act extensively used by humans to communicate a negative inconsistency between reality and expectations. Previous work on automatically identifying complaints in social media has focused on using feature-based and task-specific neural network models. Adapting state-of-the-art pre-trained neural language models and their combinations with other linguistic information from topics or sentiment for complaint prediction has yet to be explored. In this paper, we evaluate a battery of neural models underpinned by transformer networks which we subsequently combine with linguistic information. Experiments on a publicly available data set of complaints demonstrate that our models outperform previous state-of-the-art methods by a large margin achieving a macro F1 up to 87.


Introduction
Complaining is a basic speech act, usually triggered by a discrepancy between reality and expectations towards an entity or event (Olshtain and Weinbach, 1985;Cohen and Olshtain, 1993;Kowalski, 1996). Social media has become a popular platform for expressing complaints online (Preotiuc-Pietro et al., 2019) where customers can directly address companies regarding issues with services and products. Complaint detection aims to identify a breach of expectations in a given text snippet. However, the use of implicit and ironic expressions and accompaniment of other speech acts such as suggestions, criticism, warnings and threats (Pawar et al., 2015) make it a challenging task. Identifying and classifying complaints automatically is important for: (a) improving customer service chatbots (Coussement and Van den Poel, 2008;Lailiyah et al., 2017;Yang et al., 2019a); (b) linguists to analyze complaint characteristics on large scale (Vásquez, 2011;Kakolaki and Shahrokhi, 2016); and (c) psychologists to understand the behavior of humans that express complaints (Sparks and Browning, 2010).
Previous work has focused on binary classification between complaints and non-complaints in various domains (Preotiuc-Pietro et al., 2019;Jin et al., 2013;Coussement and Van den Poel, 2008). Furthermore, some studies have performed more fine-grained complaint classification. For instance, complaints directed to public authorities have been categorized based on their topics (Forster and Entrup, 2017;Merson and Mary, 2017) or the responsible departments (Laksana and Purwarianti, 2014;Gunawan et al., 2018;Tjandra et al., 2015). Other categorizations are based on possible hazards and risks (Bhat and Culotta, 2017) as well as escalation likelihood (Yang et al., 2019a). Most of these previous studies have used supervised machine learning models with features extracted from text (e.g. bag-of-words, topics, features extracted from psycho-linguistic dictionaries) or task-specific neural models trained from scratch. Adapting state-of-the-art pre-trained neural language models based on transformer networks (Vaswani et al., 2017) such as BERT (Devlin et al., 2018) and XLNet (Yang et al., 2019b) has yet to be explored.
In this paper, we focus on the binary classification of Twitter posts into complaints or not (2019). We adapt and evaluate a battery of pre-trained transformers which we subsequently combine with external linguistic information from topics and emotions.
Contributions (1) New state-of-the-art results on complaint identification in Twitter, improving macro FI by 8.0% over previous work by Preotiuc-Pietro et al. (2019); (2) A qualitative analysis of the limitations of transformers in predicting accurately whether a given text is a complaint or not.

Complaint Prediction Task and Data
Given a text snippet (i.e. tweet), we aim to classify it as a complaint or not. For that purpose, we use the data set 1 by Preotiuc-Pietro et al. (2019) which contains tweets written in English that were manually annotated as complaints or not. It includes 1,232 complaints (62.4%) and 739 non-complaints (37.6%) over 9 domains (e.g. food, technology, etc.). Data statistics are shown in Table 1. We opted using this data set because (1) it is publicly available; and (2) it allows a direct comparison with existing methods. We also use the data for distant supervision 2 collected by Preotiuc-Pietro et al. (2019). This extra 'noisy' data source contains 18,218 complaint tweets collected by querying Twitter API with certain complaint related hashtags (e.g. #badbusiness, #badcustomerservice, etc.) and the same amount of noncomplaint tweets that were sampled randomly.

Transformer-based Models
Transformer architectures trained on language modeling have been recently adapted to downstream tasks demonstrating state-of-the-art performance (Weller and Seppi, 2019;Gupta and Durrett, 2019;Maronikolakis et al., 2020). In this paper, we adapt and subsequently combine transformers with external linguistic information for complaint prediction.
BERT, ALBERT and RoBERTa Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) learns language representations by jointly conditioning on both left and right contexts using transformers. It is trained on masked language modeling where some of the tokens are randomly masked with the aim to predict them using only the context.
We further experiment with ALBERT (Lan et al., 2019) and RoBERTa . ALBERT uses two parameter-reduction methods to address memory limitations and long training time of BERT: (a) factorized embedding parameterization; (b) cross-layer parameter sharing. RoBERTa is an extension of BERT trained on more data with larger batch size using dynamic masking (i.e. changeable masked tokens of each sequence during training epochs). We adapt BERT, ALBERT and RoBERTa by adding a linear layer with a sigmoid activation and then fine-tune it on the complaint classification data.
XLNet XLNet (Yang et al., 2019b) uses a similar architecture to BERT to learn bidirectional contextual information. Instead of masked tokens used in BERT, XLNet maximizes the expected log-likelihood of all possible factorization orders. We adapt and fine-tune the XLNet model for complaint prediction similar to BERT.
M-BERT To combine our model with external linguistic information, we adapt the Multimodal BERT (M-BERT) (Rahman et al., 2019) model structure that has been introduced for multimodal modeling (text, image, speech). Instead of cross-modal interactions, we inject extra linguistic information as alternative views of the data into the pre-trained BERT model. We use (a) Emotion, a 9 dimensional vector obtained by quantifying six basic emotions of Ekman (1992) for each tweet using a predictive model by Volkova and Bachrach (2016); (b) Topics, a 200 dimensional vector representing word frequencies in word clusters designed to identify semantic themes in tweets by Preotiuc-Pietro et al. (2015;2015). To inject external linguistic information to M-BERT, 3 we first project the linguistic information into vectors with similar size to the BERT CLS embeddings. Then we concatenate word representations obtained from BERT and the linguistic information (Emotion, Topics or Emotion+Topics) to generate combined embeddings. During concatenation, an Attention Gating Mechanism called Multimodal Shifting Gate (Wang et al., 2019) is applied to control the importance of each representation. Finally, the combined embeddings are fed to BERT for fine-tuning. The rest of the architecture is the same as BERT.

Experiments
Baselines We compare the transformer-based models with two previous approaches for complaint identification by Preotiuc-Pietro et al. (2019) and a transfer learning method: (1) Logistic Regression with bag-of-words trained using the original and distantly supervised complaint data (LR-BOW + Dist. Supervision); (2) A Long-Short Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997) that takes as input a tweet, maps its words to embeddings and subsequently passes them through the LSTM to obtain a contextualized representation which is finally fed to the output layer; (3) Adapting the pre-trained Universal Language Model Fine-tuning (ULMFiT) model (Howard and Ruder, 2018) for complaint prediction. ULMFiT uses a AWD-LSTM (Merity et al., 2017) encoder for language modeling.
Evaluation Following Preotiuc-Pietro et al. (2019), we use a nested 10-fold cross-validation approach to conduct our experiments for complaint prediction. In the outer 10 loops, 9 folds are used for training and one for testing; while in the inner loops, a 3-fold cross-validation method is applied where 2 folds are used for training and one for validation. During training, an early stopping method is applied based on the validation loss. We measure predictive performance using the mean Accuracy, Precision, Recall and macro F1 over 10 folds (we also report the standard deviations). Table 2 shows results of the Transformer-based models as well as the baselines on the complaint prediction task. All transformer-based models (BERT, ALBERT, RoBERTa and XLNet) perform better than the previous feature-based (LR-BOW + Dist. Supervision) and the non-transformer transfer learning baseline (ULMFiT), indicating a better capability on capturing idiosyncrasies of complaints syntax and semanatics. BERT outperforms other models overall across all metrics reaching a macro F1 up to 87, which is 8% higher than the previous state-of-the-art (Preotiuc-Pietro et al., 2019). The results of RoBERTa are close to BERT with 86.6 macro F1 while ALBERT and XLNet achieve lower performance.

Predictive Results
Distant supervision is beneficial only to ULMFiT and M-BERT while BERT and other transformer models perform worse, which are consistent with results of Bataa and Wu (2019) for sentiment analysis.  Error Analysis We also investigate the limitations in predicting capacity of our best performing model (BERT). We randomly analyze 100 cases in predictive results, where 50 cases were misclassified as noncomplaints and another 50 cases were misclassified as complaints. In cases where complaints were misclassified as non-complaints, 26% errors are due to implicit expressions while 14% errors are because complaints contain irony. In the former situation, complaints express weak emotional intensity without explicit reproach, where complainers imply their dissatisfaction instead of directly complaining or mentioning the cause (Trosborg, 2011). The following tweet is a typical example: It started yesterday , but I try again it could work normal. But since last night its just like this <url> Such expressions rarely include words related to complaints (e.g. 'disappointed', 'bad service') and are therefore difficult to be correctly classified. In the latter situation, complaints are expressed in an ironic way using terms such as 'congratulations', 'thank you' and 'brilliant'. For instance, the following text was wrongly classified as a non-complaint: Thank you so much for making a box that shreds apart even when carried by both handles.
In cases where non-complaints were misclassified as complaints, errors can be roughly divided into four categories: (1) 26% errors are because certain terms appear frequently in complaints during training such as 'thank you', 'dm', 'lost', 'work'. The following non-complaint was wrongly classified as a complaint: BTW <user> -<user> did me right, and replaced my two failed batteries under warranty. I'm happy :) thanks <user>!
It contains similar words with the following complaint in the same fold (similarities highlighted in bold): Was happy to find out <user> had an app to watch all their shows, until 6 episodes in it stops working. Thanks! <user>   (2) 22% errors due to interrogative tone, which is common in complaints. An example is "Folks , what is cost of text message to a us number?" (3) 22% errors are from negation words such as "No luck with pc or phone." (4) 12% errors are because texts contain negative sentiment such as "This would be a terrible idea <url>" are likely to be classified as complaints incorrectly since words such as 'terrible' are widely used to express dissatisfaction. However, there are not enough cues to indicate violation of expectations. According to the statistics, the proportion of complaints misclassified as non-complaints (15.22%) is higher than that of non-complaints misclassified as complaints (10.25%) indicating implicit and figurative expressions as well as unknown factors in complaints are more challenging to identify.
Cross Domain Experiments Finally, we use BERT to train models on one domain and test on another as well as training on all domains except the one that the model is tested on. Table 3 shows the performance of models in Preotiuc-Pietro et al. (2019) (left) and BERT (right) across 9 domains. We first observe that BERT results in nearly half of the cases when training on a single domain are lower than LR-BOW (especially 'Food', 'Car' and 'Other') while BERT trained on all domains performs better across all testing domains, achieving a macro F1 up to 88.2 when tested on 'Other'. This indicates that, fine-tuning BERT on a small training data set ('Food', 'Car' and 'Other' are three of the domains with the smaller amount of data) is not enough to make it perform well. In contrast, it achieves better performance consistently on larger data sets (All). We also notice that BERT performs robustly for domain pairs where the domains are either used for training or testing. For example, training on 'Apparel' achieves high performance when testing on 'Software' (81.8 F1) and vice versa (80.0 F1). Furthermore, domain relevance affect predictive performance. For example, BERT trained on 'Transport' achieves 79.2 F1 when tested on 'Car', which is the highest performance compared to other training domains since these two domain share common vocabulary (see 'Car' column for BERT).

Conclusion
We evaluated a battery of transformer networks on the Twitter complaint identification task and obtained 87 macro F1, which outperforms the previous state-of-the-art results of Preotiuc-Pietro et al. (2019). We further presented a thorough analysis of the limitations of our models in predicting complaints. In future work, we intend to explore more in how we can combine other sources of linguistic information with transformers as well as information from other modalities such as images.