A Million Tweets Are Worth a Few Points: Tuning Transformers for Customer Service Tasks

In online domain-specific customer service applications, many companies struggle to deploy advanced NLP models successfully, due to the limited availability of and noise in their datasets. While prior research demonstrated the potential of migrating large open-domain pretrained models for domain-specific tasks, the appropriate (pre)training strategies have not yet been rigorously evaluated in such social media customer service settings, especially under multilingual conditions. We address this gap by collecting a multilingual social media corpus containing customer service conversations (865k tweets), comparing various pipelines of pretraining and finetuning approaches, applying them on 5 different end tasks. We show that pretraining a generic multilingual transformer model on our in-domain dataset, before finetuning on specific end tasks, consistently boosts performance, especially in non-English settings.


Introduction
Online platforms and social media are increasingly important as communication channels in various companies' customer relationship management (CRM). To ensure effective, qualitative and timely customer service, Natural Language Processing (NLP) can assist by providing insights to optimize customer interactions, but also in real-time tasks: (i) detect emotions (Gupta et al., 2010), (ii) categorize or prioritize customer tickets (Molino et al., 2018), (iii) aid in virtual assistants through natural language understanding and/or generation (Cui et al., 2017), etc.
Despite this NLP progress for CRM, often small and medium-sized companies (SMEs) struggle with applying such recent technology due to the limited size, noise and imbalance in their datasets. General solutions to such challenges are transfer learning strategies (Ruder, 2019): feature extraction uses frozen model parameters after pretraining on an external (larger) training corpus, while finetuning continues training on the smaller in-domain corpus. In the large body of work adopting such strategies (e.g., Pan and Yang 2009), little effort has been put into addressing specific CRM use cases that need to rely on social media data that is noisy, possibly multilingual, and domain-specific for a given company.
In this paper, we analyze the possibilities and limitations of transfer learning for a number of CRM tasks, following up on the findings of Gururangan et al. (2020) who demonstrate gains from progressive finetuning on in-domain and taskspecific monolingual data. Specifically, our contributions are that we (1) collect a multilingual corpus of 275k Twitter conversations, comprising 865k tweets between customers and companies in 4 languages (EN, FR, DE, NL), covering distinct sectors (telecom, public transport, airline) (Section 4.1); (2) rigorously compare combinations of pretraining and finetuning strategies (Section 3) on 5 different CRM tasks (Section 4.2), including prediction of complaints, churn, subjectivity, relevance, and polarity; and (3) provide empirical results (Section 5). We find that additional pretraining on a moderately sized in-domain corpus, before task-specific finetuning, contributes to overcoming the lack of a large multilingual domain-specific language model. Its effect is much stronger than consecutive finetuning on smaller datasets for related end tasks. Furthermore, our experimental results show that when pretrained models are used in feature extraction mode, they struggle to beat well-tuned classical baselines.

Related Work
A wide range of NLP research has been devoted to customer services. Hui and Jha (2000) employed data mining techniques to extract features from a customer service database for decision support and machine fault diagnosis. Gupta (2011) extracted a set of sentiment and syntactic features from tweets for customer problem identification tasks. Molino et al. (2018) introduced the Customer Obsession Ticket Assistant for ticket resolution, using feature engineering techniques and encoder-decoder models. Highly popular pretrained language models, such as BERT (Devlin et al., 2019), have also been explored for different customer service tasks: Hardalov et al. (2019) considered re-ranking candidate answers in chatbots, while Deng et al. (2020) proposed BERT-based topic prediction for incoming customer requests.
Although the performance gains obtained by pretraining language models are well-established, they need further exploration in terms of multilinguality. Some studies (Pires et al., 2019;Karthikeyan et al., 2019;Wu et al., 2019) have investigated the transferability of multilingual models on different tasks, but they do not consider the effect of progressive pretraining on a smaller and less diverse multilingual corpus, as we will do.

Architecture
We selected some of the most popular publicly available pretrained language models to explore transfer learning properties for CRM classification tasks: RoBERTa (Liu et al., 2019), XLM (Conneau et al., 2020), andBERTweet (Nguyen et al., 2020). These models are pretrained on the English Wikipedia and BookCorpus (Zhu et al., 2015), CommonCrawl in 100 languages, and 850M English tweets, respectively. The XLM and BERTweet pretraining procedure is based on RoBERTa, which itself is a transformer-based Masked Language Model (MLM; Devlin et al., 2019). All of these models require a different classifier 'head' for each target task to estimate the probability of a class label.

Transfer Strategies
We adopt a straightforward approach to transfer learned representations: we continue pretraining the considered transformer models on a 4-lingual corpus of customer service Twitter conversations (see Section 4.1), i.e., the overall domain of all considered sub-tasks. After that, we apply additional adaptation for cross-lingual transfer (Section 5.1), as well as cross-task transfer (Section 5.2).
The following notations are used throughout the rest of this paper to describe pretraining stages: • π -further pretraining the original MLM on our 4-lingual tweet conversation corpus. • ϕfinetuning the pretrained model extended with the MLP classifier on the target task • ∅ -freezing the pretrained model (i.e., feature extraction mode), only training the top classifier on the target task. We thus indicate several multistage procedures: e.g., XLM→ π → ϕ indicates that the XLM model is further pretrained on the in-domain tweet corpus, followed by finetuning on the end task.

Experimental Setup
We focus our experiments on text classification problems that are commonly dealt with by customer service teams. First, we describe our Twitter conversation corpus used for in-domain finetuning (Section 4.1), then we introduce the target tasks and corresponding datasets (Section 4.2). For most target tasks, we hold out 10% of the data for testing, while the remaining part is used for training. We then utilize 10-fold cross-validation on the training data to select optimal hyper-parameters for each end task. When the dataset comes with a predefined train-test split, we keep that. For the pretrained transformer models (RoBERTa, XLM, BERTweet), we use the publicly available 'base' versions.

Twitter Conversation Corpus
Our corpus for in-domain pretraining was crawled using Twitter's API. 2 The collected dataset is small compared to the original language models' data, but still larger than most corpora which SMEs have at their disposal. As such, it represents an easily collectable customer service dataset that SMEs can leverage to boost models on their own data. The tweets were gathered between May and October 2020. We started by gathering a list of 104 companies, all active on Twitter, in the sectors of telecommunication, public transportation, and airlines. We aimed for four different languages (English, French, Dutch, German).
We preprocessed the data by removing conversations not covering at least one client/company interaction, or containing undefined languages. We further converted usernames and links into the special tokens @USER and @HTTP URL, respectively, and converted emojis into corresponding strings. 3 The resulting corpus contains 865k tweets over 275k conversations in the four target languages (see Table 1). Even though our corpus contains data from different sectors, we noticed that the dialogue flow, customer intents, and structure of conversations are fairly comparable across the target sectors (cf. Fig. 1). Examples of often recurring types of tweets are expressions of gratitude towards customers, requests for information, or typical ways to reply to complaints. Hence, we expect this corpus to be useful not only for companies that fall under one of the included sectors, but also for other companies that provide customer services over tweets.

CRM Tasks and Datasets
Complaint Prediction -Timely complaint detection is of utmost importance to organizations, as it can improve their relationship with customers and prevent customer churns. Preoţiuc-Pietro et al. 3 We used https://github.com/carpedm20/ emoji to convert emojis (2019) and Greenleaf et al. (2015) proposed two datasets for identifying complaints on social media which contain 3,499 and 5,143 instances, respectively. The former (Complaint-2) covers two types of companies (airline companies and telecommunication), while the latter (Complaint-9) consists of data from nine domains such as food, car, software, etc. Both datasets are in English. To experiment with cross-lingual tuning for complaint prediction, we use the French complaint dataset for railway companies from (Complaint-R; Ruytenbeek et al. 2020). Since all their 201 conversations are labeled as complaints, for training, we complemented them with negative sampling from French railway conversations in our own Twitter corpus. For testing, we annotated 200 held-out conversations. Churn prediction -Customer churn implies that a customer stops using a company's service, negatively impacting its growth. Churn prediction is cast as a binary classification task (churn or nonchurn) on any input text. We utilize the data provided by Amiri and Daume III (2015) with tweets from three telecommunication brands, resulting in a corpus of 4,339 labelled English tweets. Subjectivity Prediction -Detecting subjectivity in conversations is a key task for companies to efficiently address negative customer feelings or reward loyal customers. It may also serve as a filtering task for more fine-grained tasks such as emotion identification. We annotated 8,174 Dutch conversations from our Twitter corpus (Section 4.1). A dialogue is judged "subjective" if at least one of the customer turns contains emotions (explicit or implicit), and otherwise "objective". Relevance Prediction -The goal of this task is to determine whether an incoming text is relevant for further processing or not. We use data from GermEval 2017 (Task A) which contains over 28k short length messages from various social media and web sources on the German public train operator Deutsche Bahn (Wojatzki et al., 2017). For this dataset, the evaluation is measured on two evaluation sets: one collected from the same time period as the training and development set (viz. synchronic), and another one containing data from a later time period (viz. diachronic). Polarity Prediction -For this task, a system has to classify the sentiment that resides in a given text fragment according to polarity (positive, negative, or neutral). Polarity prediction has often been applied on reviews, by predicting the attitude or senti-  Table 2: Classification results (accuracy ACC and F1-score) on CRM tasks using pretrained language models with two settings for pretraining: Feature extraction (→ ∅) and finetuning (→ ϕ). Missing values ('-') are due to unavailable reference scores, or a language mismatch between model and task.

Results and Discussion
We now present our findings for two finetuning scenarios: transfer across languages and across tasks. Section 5.1 investigates the effect of unsupervised multilingual pretraining. Section 5.2 then explores how to further improve by finetuning the pretrained language models on similar tasks.

Language Transferability
We compare the pretrained transformer experiments with the following baselines: majority-class (to get an idea of class imbalance), logistic regression (LR) and support vector machine (SVM) with tf-idf features. For the three transformer models, we compare the feature extraction setting (→ ∅) with finetuning (→ ϕ) on the target task. On the multilingual XLM, we measure the impact of first pretraining (→ π) on our multilingual tweet corpus, after which both transfer settings are again tested on the target tasks. Table 2 reports the results (in terms of accuracy and F1 scores), including scores from literature when available ('Reference'). It should be noted that the reference scores are not state-of-the-art, but they are the scores communi-cated in the original dataset papers.
Only for the English tasks (Complaint-2 and Churn), results for BERTweet and RoBERTa are reported. The monolingual tweet-based model BERTweet outperforms all other models when finetuned on these tasks. Although a large domainspecific mono-lingual language model seems a fine choice, it may not be available for other languages. We therefore investigate the impact of a multilingual generic model (XLM was not specifically pretrained on tweets), and the impact of additional finetuning on our dedicated twitter corpus.
In general, transformer models finetuned on the end task strongly outperform frozen ones. For the non-English tasks, the model XLM→ ∅ with the frozen XLM encoder shows weak performance, in some cases below the baselines. The model XLM→ ϕ finetuned on the end task performs better. For the non-English tasks, the XLM model pretrained on our Twitter corpus and finetuned on the tasks (XLM→ π→ ϕ) in all cases outperforms the finetuned XLM by a few percentage points and the baselines by an even larger margin. The performance differences between XLM→ ϕ and XLM→ π→ ϕ clearly underscore the importance of indomain multilingual pretraining. Furthermore, the results of XLM→ π→ ϕ for the English tasks suggest that additional pretraining on a moderately small, in-domain dataset can make the performance of the multilingual XLM model comparable to the monolingual RoBERTa.
Another promising observation is that the hypertuned classical baselines, such as SVM, are strong competitors compared to frozen language models, especially on tasks that are highly sensitive to domain-specific features. For instance, for churn prediction, keywords such as 'switch to', 'quit' and 'change provider' can easily be triggered by the SVM, while frozen pretrained models have not learned to identify these features. This finding might be helpful to achieve better insight into the operational aspects of frozen neural models compared to simple classical approaches.
As a side result (not explicitly included in this work) we found that the multistage pretraining (XLM→ π) leads to better performance when incorporating multiple languages compared to a single language. The performance drops especially when training data from a single language (e.g., Dutch) is fed into the model, which is then evaluated on other languages (e.g., English).

Task Transferability
We now investigate to what extent representations tuned on a related task can help for a given target task. In particular, Complaint-9 is the end task, and we compare the effect of finetuning on the end task only, vis-à-vis first finetuning on a related task and then on the end task. For the related task, we experiment with Complaint-2 and Sanders, as shown in Table 3. We observe that there seems to be no clear merit in the additional finetuning step on a small related end task. Pretraining on our larger Twitter corpus, however, still increases effectiveness.

Conclusion
We investigated multilingual and across-task transfer learning for customer support tasks, based on transformer-based language models. We confirmed prior insights that finetuning the models on lowresource end tasks is important. Additional pretraining on a moderately sized in-domain corpus, however, provides a complementary increase in effectiveness, especially in the non-English setting and starting from a generic multilingual language model. We provide a newly collected multilingual in-domain corpus for customer service tasks and derive the aforementioned findings from experiments using it on five different tasks.