Using BERT for Qualitative Content Analysis in Psychosocial Online Counseling

Qualitative content analysis is a systematic method commonly used in the social sciences to analyze textual data from interviews or online discussions. However, this method usually requires high expertise and manual effort because human coders need to read, interpret, and manually annotate text passages. This is especially true if the system of categories used for annotation is complex and semantically rich. Therefore, qualitative content analysis could benefit greatly from automated coding. In this work, we investigate the usage of machine learning-based text classification models for automatic coding in the area of psycho-social online counseling. We developed a system of over 50 categories to analyze counseling conversations, labeled over 10.000 text passages manually, and evaluated the performance of different machine learning-based classifiers against human coders.


Psycho-Social Online Counseling
Online counseling has developed into a fullfledged psycho-social counseling service in Germany since the 1990s. Today, people can get advice on a wide variety of psycho-social topics in web forums and dedicated text-based counseling platforms. Online counseling is provided by psychosocial professionals who have received special training in this method. Similar to face-to-face psycho-social counseling, some aspects are known to make up high-quality online counseling, but there is few empirical evidence for special impact factors (Fukkink et al 2009, Dowling & Rickwood 2014.
Due to the complexity of the content, quantitative approaches have not been able to analyze the meaning and significance of methodical patterns in large numbers of consulting communications (Navarro et al. 2019). It is, however, possible to understand and describe the meaning of online counseling content with qualitative approaches (Bambling et al. 2008, Gatti et al. 2016. This allows linking certain interventions of the counselors to the reactions of the clients on a caseby-case basis. But generalized statements on causal relationships are not possible with the small number of cases from qualitative studies (Ersahin & Hanley 2017).
An analysis of large numbers of counseling conversations using qualitative social research tools would help to better understand how successful online counseling works. Few related studies on these topics are available. Althoff et al. (2016) defined different models to measure general conversation strategies like adaptability, dealing with ambiguity, creativity, making progress or change in perspective and illustrated their applicability on a corpus of data from SMS counseling. Pérez-Rosas et al. (2019) analyzed the quality of consulting communications based on video recordings. Their automatic classifier used linguistic aspects of the content and could predict counseling quality with relatively good accuracy. However, neither of the mentioned approaches had the intention to recognize the meaning of individual phrases even though this deep understanding is crucial to eliminate weaknesses in the education of online counselors (Luitgaarden et al. 2016, Niuewboer et al. 2014). In addition, systems could be developed to provide online advisors with practical suggestions for improving their work.

Qualitative Content Analysis
Qualitative social research is a generic term for various research approaches. It attempts to gain a better understanding of people's social realities and to draw attention to recurring processes, patterns of 12 interpretation, and structural characteristics (Kergel, 2018).
One such research approach deals with the content analysis of texts, the so-called qualitative content analysis according to Mayring (2015). It is a central source of scientific knowledge in qualitative social research. It tries to determine the subjective meaning of contents in texts. For this purpose, categories are formed based on known scientific theories on the topic and the discursive examination of the content. The definitions of those categories along with representative text passages are summarized in a codebook.
Then, human coders are coached in using the codebook. The coaching process and the implementation of the coding require high human expertise and manual effort because the coders must read, interpret, and annotate each text passage. Thus, qualitative studies can only be applied to a limited number of texts. Furthermore, it is hardly possible to define the categories so precisely that all coders find identical results, as human language is inherently ambiguous and its interpretation always partly subjective.
Machine learning could be a solution to the dilemma: If a trained model was able to categorize parts of the conversations according to a given codebook with similar accuracy as a human, the time-consuming text analysis could be automated.

Machine Learning for Qualitative Content Analysis
Previous studies have shown that supervised machine learning is generally suitable for qualitative content analysis (Crowston e.a. 2010, Scharkow 2013. However, these studies used only a few categories that could be distinguished relatively good, e.g. news categories like sports and business. Online counseling, in contrast, is a complex domain. A detailed system of categories is necessary to identify impactful patterns in counseling conversations. Additionally, many categories such as "Empathy" or "Compassion" are quite similar in terms of the words used and can only be distinguished if the model is able to somehow "understand" the meaning of the texts. Recent neural models have drastically outperformed previous approaches for sophisticated problems like sentiment analysis and emotion detection (Howard&Ruder 2018, Devlin e.a. 2018, Chatterjee e.a. 2019). We wanted to investigate if these models can be used for qualitative content analysis of online counseling conversations.

Research questions / Contribution
Our first research question is whether it is possible to train a model to identify psycho-social codes with a human-like precision. It also needs to be clarified whether a certain machine learning approach is particularly well suited for certain topics.
It is assumed that this training does not work equally well with all codes of the codebook. Therefore, the second question is which characteristics codes must have in order to be learned particularly well or particularly poorly.
In social science research, the discussion of different assessments of text passages is an important part of the scientific process. Therefore, the analysis of codes incorrectly assigned by a model is an important part of this work. The third research question is, therefore: What differences can be observed between the machine and human coding of text passages? If the deviations are plausible, they can be perceived as enriching the discursive process.

Methodology and Structure of the Paper
For the experimental evaluation, the social scientists in our interdisciplinary team created a codebook consisting of over 50 fine-grained categories and labeled over 10.000 text sequences of psychosocial counseling conversations (described in Section 2). The computer scientists then trained and evaluated a support-vector machine and different state-of-the-art models (e.g. ULMFit and BERT) on the provided data set (Section 3). Finally, the team investigated how human coders from the social sciences perform in comparison to the BERT model on a subset of the data (Section 4).

Creating the Data Set
Online forums for psycho-social counseling provide a good basis for an empirical evaluation because they contain large amounts of publicly accessible data. For our study, we used posts from a German site for parent counseling. Here, parents who have problems in bringing up their children are seeking advice. Possible topics are, for example, drug abuse by the child or inadequate school performance. A user can start a new thread with a problem description. Professional counselors reply and discuss solution approaches with the initial 13 user and others. Thus, each thread contains a series of posts with questions and suggestions about the initially described problem. Since we are especially interested in counseling patterns, we focused on the posts of professional counselors in our analysis.

Development of the Codebook
Based on existing scientific theories (Fukkink et al 2009, Dowling & Rickwood 2014 on online counseling and first analyses of the text content, a first version of the codebook was created. The various aspects expected in counseling conversations were mapped to a logical hierarchical structure (see Figure 1). The top-level covers general counseling aspects, such as "General attitudes" or "Impact factors". On the intermediate level, these aspects were distinguished more finely, e.g. "Help for problem overcoming". The categories at the lowest level are the ones to be used for the annotation of the text passages, such as "Recommendation for action" or "Warning / forecast".
The different codes were defined as precisely as possible and provided with typical examples. The team of coders applied this codebook to the counseling texts in several turns and iteratively improved the codebook. The final version consists of 51 granular categories (see Appendix A).

Data Labeling
Based on the codebook described in Section 2.1, a team of coding social scientists manually labeled over 10.000 text sequences in 336 threads. Such a sequence can consist of only a few words (e.g. a greeting) or even multiple sentences (e.g. a recommended action). Sequences, however, do not overlap, i.e. each word should be part of only one labeled sequence. See Figure 2 to get an idea.
In the end, we obtained a heavily imbalanced data set: The average number of samples per category is about 200, but the numbers vary greatly (see Appendix A for more details). For some categories in the area "Impact factors", e.g. "Evaluation / understanding / calming" or "Experience / explanation / example" we obtained over 1000 samples, whereas other categories including "Change" or "Suggestion to put oneself in a problem situation physically" are barely represented. Such an unequal distribution of the frequencies of single codes is not unusual in the social sciences. Since there is no statistical analysis in qualitative research, this is usually not a problem. There are even some research approaches that consider the analysis of very rare codes, in particular, to be extremely insightful (Glaser 2017).

Data Preparation and Preprocessing
After labeling, we tested the impact of common preprocessing techniques like lemmatization and the removal of usernames. It turned out that both, the support-vector machine classifier as well as the BERT model work best without any of these techniques. Therefore, we used the labeled data without such modifications.
However, the BERT model can only process fixed-length sequences consisting of at most 512 subword units called WordPiece tokens (Vaswani et al., 2017). Thus, we restricted the sequence length for all training data. We decided to work with a limit of only 256 WordPiece tokens. This value provides a good trade-off between performance and resource consumption in our setting. Longer sequences yield potentially more accurate  results but generate a high overhead because all sequences must be padded to the specified length.
Since only a little more than 1% of the complete data samples contain more than 256 WordPiece tokens, we did not lose much information (cf . Table  1). Instead, the trade-off in length allowed using higher batch sizes and faster training.
To make the results of the different classifiers comparable and to take the data set imbalance into account, a stratified 70-30-train test split was performed on the data set. This results in a training data set with 7169 samples and a test data set with 3072 samples in total. See Appendix A for the number of samples in each category.

Model-Based Classification of Psycho-Social Text Sequences
As a result of the created codebook and the collected data, our classification task consists of classifying psycho-social text sequences into one of 51 categories. For the training of the classifiers, the data set described in the previous section with 7169 samples is used. The created models are then evaluated against the 3072 samples in our test data set.

Support-Vector Machine as a Baseline
The support-vector machine (SVM) is a commonly used classifier due to being lightweight, benefitting from fast training times, and still achieving good results in text classification tasks (Aggarwal, 2018, pp. 12). Therefore, the SVM was chosen as a baseline model. The prepared data was transformed into TF-IDF vectors (bag-of-words) for training and evaluation (Aggarwal, 2018, pp. 24-26). The model was implemented using the scikitlearn library. The hyperparameters used were chosen according to the results of our hyperparameter tuning. Apart from the default parameters of the TF-IDF-vectorizer, a max_df-value of 0.5 and a min_df-value of 0 was used. Additionally, the inverse-document-frequency reweighting was enabled and unigrams, as well as bigrams, were considered. The support-vector classifier itself used a sigmoid kernel with the gamma value set to "scale", a C-value of 10, and enabled probability estimates which internally enables 5-fold crossvalidation.
The SVM achieved a total accuracy of 68.8% on the test data (cf. Table 2). Due to the heavily imbalanced data set, however, the total accuracy is not a good indicator of the model's performance. Thus, we also calculated the macro and weighted F1 scores. The SVM achieves a weighted F1 score of 68.0% (close to the accuracy) and a macro F1 score of 39.7%. The low macro F1 indicates, that classes with little support are frequently misclassified.
A detailed analysis of the results shows that the SVM achieves quite good results in categories with a large number of training samples. For instance, an F1 score of 76.2 % is achieved in the category "Experience / explanation / example" with 1398 training and 599 test sequences. Furthermore, simple sequences that only contain few keywords, such as greeting phrases in the category "Start of conversation", can also be identified quite well, even though only a few training samples exist. In particular, the category "General salutation" achieves an F1 score of 75.0% while only having 22 training and 9 test samples. More complex categories, such as the expression of "Empathy for others", however, achieve lower F1 scores of 59.8% even with a relatively high number of 118 training and 51 test samples. Other categories like "Warning / forecast" achieve even lower F1 scores of only 29.3% even though having 71 training and 30 test samples.

BERT as Advanced Classifier
BERT is a multi-layer bidirectional Transformer encoder based on the original Transformer implementation described in Vaswani et al. (2017). BERT is typically pre-trained on two unsupervised learning tasks. After the pre-training, the model can be fine-tuned according to the downstream task (Vaswani et al., 2017).
For the classification task in our approach, we used the BertForSequenceClassification implementation from the Hugging Face's Transformers library  that combines the BERT Transformer model with a sequence classification head on top (Hugging Face, 2020).
In total, we tested thirteen pre-trained BERT models. Among the ten tested German language models, the results varied between a weighted F1 score of 69.3% and 74.4% on the test data set, whereby the best result was achieved with the pre-  All of the following analyses are, therefore, based on the best performing DBMDZ BERT model.
The hyperparameters used for the fine-tuning were taken from the original BERT publication (Devlin e.a., 2018). Since we are using text sequences with a length of 256 WordPiece tokens, a batch size value of no more than 16 was possible due to GPU memory limitations. Larger models, especially multi-lingual models, even only allowed a batch size of 8. Further testing has shown that the best results can be achieved with a learning rate of 2e-5 and 4 epochs. Table 2 shows the different evaluation metrics for both, the SVM and the best BERT classifier.

Analyzing the Classification Results
The low macro F1 score with 29.2% of the BERT classifier compared to the 39.7% of the SVM classifier shows that the BERT classifier performs significantly worse on classes with few samples than the SVM classifier. The result of the weighted F1 score of 74.4% of the BERT model compared to the 68.0% of the SVM model, however, indicates that the BERT classifier outperforms the SVM if the whole data set is considered. Table 3 shows an extract from the classification report. In general, the BERT classifier improves in its performance with the increase in available training samples for each class.
In specific categories, such as "Empathy for others", this observation is not true. Categories with this behavior often contain previously mentioned category-specific keywords or phrases which is why the simple bag-of-words approach outperforms the more complex BERT techniques from a statistical point of view. A detailed analysis of the misclassified sequences by the BERT model, however, has shown that the classification of these sequences is not inherently wrong but rather shows suitable alternative affiliations to categories. This behavior is examined in greater detail in Section 3.6.

Examining other Classification Models
In addition to BERT, other classification models, such as DistilBERT (Sanh et al., 2019), XLM-RoBERTa , XLM (Lample and Conneau, 2019), and ULMFit (Howard and Ruder, 2018) were examined in our study as well. Table 4 shows the best weighted F1 scores of each model. The DistilBERT model performs around 4% worse than the best BERT model on our test data set. This difference lies around the range described by the authors of the DistilBERT paper (Sanh et al., 2019). In addition to that, both the XLM-RoBERTa and XLM models also perform worse than the best BERT classifier. Apart from the Transformer approaches, the bidirectional RNN model called ULMFit was also analyzed. The results show that the different Transformer models as well as the ULMFit model generally perform quite similar on our classification task, except for the XLM model that performs even worse than the simple SVM approach.

Explaining the Classifiers
Since predictions of BERT, or Transformer models in general, are often untransparent and difficult to   justify, different approaches, such as LIME (Ribeiro et al., 2016) or Attention Flow (Abnar and Zuidema, 2020), can be used to generate model insights. While LIME takes a retrospective approach that can be applied to any classification model, Attention Flow tries to visualize the actual attention maps of Transformer models. Both approaches provide insights that can be used to explain the classification predictions of the models. Since we want to generate model insights regardless of the approach used to create the model, we decided to use LIME as our analyzing tool of choice.
For example, the analysis of the sentence "Have you ever spoken to the kindergarten teachers?" (cf. original German sentence in Figure 3) helps to further understand the model. Originally, the sequence was coded as "Follow-up question" by the expert coders. The BERT classifier did correctly classify this sequence, whereas the SVM classifier classified this sequence as a "Questions about possible support resources".
While both assignments might sound reasonable at first, the question arises why each classifier performed its prediction. To answer this question, the text-heatmaps in Figure 3 were generated with LIME. The percentage values indicate how important the LIME model considers the corresponding word for the classification.
The BERT heatmap shows that the model mainly focuses on the words that form the question "Hast", "Du", "mit", "den", "Erzieherinnen" (Engl. "have", "you", "with", "kindergarten teachers") while the SVM heatmap shows that the SVM classifier considers all words as important for the classification but with high focus on the word "Erzieherinnen" (Engl. kindergarten teachers) which is a possible support resource.
This strong focus on individual keywords from the SVM can be explained by the operating principle of the bag-of-words approach and verifies the assumption from Section 3.3 that the SVM performs well in classes with distinctive keywords. But examples like this show that this simple approach can also be misled when such distinctive keywords appear in more complex sequences in which the keyword is not decisive for the correct class and the context has to be considered as well for the correct classification.
Since LIME follows a bag-of-words evaluation model, it cannot provide additional insights on how our BERT model exactly handles context. Thus, we can only use LIME to illustrate whether the models' decisions are reasonable, or not.

Analyzing Misclassified Sequences
To better understand our model and to identify further potential for improvement, the incorrectly classified test data were analyzed. Out of the 3072 test sequences, the BERT model classified 2325 sequences correctly. Out of the 747 incorrectly classified sequences, our team of social scientists manually examined a sample of 191 sequences. The inspected samples were randomly chosen based on conspicuous categories that were not in the diagonal of the confusion matrix. The summarized results of this examination are shown in Table 5.
The general conclusion of this analysis is that 58.1% (Table 5, I+II) of the incorrectly classified sequences are not inherently wrong but their assigned category depends on the different points of view of the coders. For example, the sequence "Have you ever talked to a pediatrician? Or do you   Table 5: Expert assessment of incorrectly classified text sequences have a family counseling center?" was initially encoded as a "Question about possible support resources" by the human encoder, whereas the BERT model associated the sequence with a "Follow-up question". In our analysis, the experts concluded that both categories would fit. Another example in which the predicted label would fit even better than the actual label is the sequence "This has to be done consequently, even if screaming is annoying. You have to go through it -sometime." This sequence was initially encoded as a "Warning / forecast" by the human experts. The BERT model, however, assigned this sequence to the category of "Recommendation for action". Since these different interpretation options are not only a technical issue but can also be observed in human coders, the intercoder reliability between an expert coder, an untrained human coder ("novice"), and BERT is analyzed in Section 4. For another 23.6% of the analyzed sequences (Table 5, III+IV), we were able to trace back the incorrect classification to the use of keywords or similar terms between different categories. For example, the simple sequence "good luck" is considered to be a "Wish" by the human encoders, whereas our BERT model mistakes this sequence for a traditional farewell phrase (category "Other farewell"). This behavior of the BERT model can be explained by the fact that some sequences in the training data contain closing phrases, such as "Good luck [user]".
In 14 more cases (Table 5, V) the experts were unable to identify any distinctive features that caused the sequences to be classified incorrectly by the BERT model.
Apart from these technical insights, in 12 cases (Table 5, VI) weaknesses in the training data set were identified, such as incorrect assignments of the actual label previously made by the human coder, sequences composed by clients rather than counselors, or sequences that only contain single characters.
Furthermore, in a total of nine sequences (Table  5, VII+VIII), the experts declared the sequences as "hard to assign for humans" due to the usage of uncommon words, not enough context, or since the sequence consists of multiple sentences with multiple categories.
To estimate the impact of the interpretation options during the classification regarding the evaluation metrics, an adjusted accuracy can be estimated. This adjusted accuracy is calculated by transferring the proportion of analyzed incorrectly classified sequences that are not inherently wrong (Table 5, I+II) to the total of the 747 incorrectly classified sequences. This means that 58.1% of the originally incorrectly classified sequences can be considered as correct. This leads to an increase of the correctly classified sequences from 2325 to 2759 which corresponds to a more than satisfying accuracy of 90%, respectively. Since this is only an overall estimation, adjusted F1 scores cannot be calculated.

Discussion about Improving the Model
To understand the influence of the availability of training samples, we ran multiple tests in which the number of training samples in a specific category was reduced. Hereby, we tested all categories that achieve an F1 score of 70% or higher. For each of the categories, six models were trained with a restricted number (10,20,50,100,250, and 500) of randomly selected training samples. All models were then evaluated on our test data set. Results have shown that simple categories, such as "General salutation", "Familiar salutation (without name)", "Welcoming", or "Follow-up question", only require about 50 training samples to achieve F1 scores of 0.71 or higher. However, categories that contain text sequences with more complex structures, such as "Experience / Explanation / Example" or "Recommendation for action", still show significant improvements when using 250, 500, or all available text sequences for training.
As described in Section 2, our training data set is unevenly distributed. Data set imbalance is a well-known problem in machine learning (He and Garcia, 2009) and in our case is due to the annotation process. Hereby, available forum posts were annotated without specifically having the category distribution in mind. Typical techniques to reduce the data set imbalance, such as random oversampling or synthetic sampling with data generation (He and Garcia, 2009), cannot easily be applied to textual data, especially not when precise phrasing and wording is important for the classification as in our case. One technique that might, however, lead to improvements is generating new text sequences by randomly combining sentences from other sequences of the same category. Other possible approaches such as aggregating categories with few examples to their superset-level were also considered but dismissed since our goal is to predict categories on a detailed level.
With the approximate number of required samples per category, we think that manually creating additional training data in especially underrepresented classes and edge-cases will, therefore, help to improve the model in the future.
Another idea to improve the model is by taking the model's first and second prediction into account. Human coders can then be supported with suggestions by the model during coding tasks and choose the best fitting label. This feedback can then be used to further improve the model.

BERT vs. Human Coders
Coding of text passages is to some degree dependent on the subjective perception of the coders. Especially for similar categories like "Empathy" and "Compassion", different coders will sometimes assign different labels to the same text. Thus, even human coders which were trained on the usage of the codebook will not reach 100% agreement. To get a better understanding of the applicability of our model for automatic coding, we compared the coding performance of BERT against a trained human coder familiar with the codebook ("expert") and an untrained human coder ("novice").

Intercoder Reliability between Experts
The degree of consensus among coders, the intercoder reliability, is often measured by Cohen's κ (kappa) coefficient (Cohen 1960, Burla et al. 2008. The maximum value of κ is 1, κ > 0.8 indicates almost perfect, and κ > 0.6 indicates substantial agreement.
During the creation of the training data, our experts regularly coded the same texts and aligned their coding style. After coding was finished, we calculated the κ coefficient between those two coders who had coded the most samples. Thereby, we considered only posts coded by both coders and text sequences with at least 75% overlap regarding the first and last word. We determined a κ coefficient of 0.73 between those two experts. This value is relatively high given our complex codebook with over 50 categories.

Intercoder Reliability between an Expert, a Novice, and BERT
To understand how our BERT model performs compared to human coders, we benchmarked the performance of the following three participants: The expert was one of the coders observed in the intercoder reliability measurement. The novice had only a little experience in text annotation and had just recently familiarized herself with the codebook and typical examples for each category. The third participant was our BERT classification model. All participants had the task to annotate the same 50 text passages. Each text passage was randomly chosen from the set of previously unlabeled forum posts.
Besides measuring the intercoder reliability among the participants, we also wanted to generate indications about which sequence length is best suited for the application of the BERT model. For typical coding tasks in the social sciences, the length of a sequence to be coded is defined by a change in the occurring category. This contrasts with most machine-based classifiers which expect a defined sequence of words as input. The choice of start and end for a label in continuous text is usually not part of the classification task.
Therefore, we generated three variants of the 50 text sequences for coding: The first data set consists of single sentences only, the second data set includes, if existing, the following sentence for each sample, and the third data set contains sequences of at most three consecutive sentences. Figure 4 illustrates the breakdown of an exemplary post.
All three data sets were then coded independently by the participants. As before, the agreement between the different coders was measured using the κ coefficient (see Table 6).
Surprisingly, the intercoder reliability between BERT and the human expert is higher than the intercoder reliability between the expert and the novice, regardless of the sequence length. In its best case, the BERT classifier achieves nearly expertexpert-like intercoder reliability with a value as high as 0.64 in comparison to the earlier calculated expert intercoder reliability of 0.73. It seems that the BERT model has learned the expert style of Figure 4: Exemplary structure of the sequences within the different data sets coding from the training data better than an untrained human coder using the codebook.
While classifying sequences that contain only one sentence was rated difficult by the human coders due to the missing context, sequences with up to 3 sentences were rated as too long since they often contained patterns from multiple categories. Therefore, sequences with the length of two sentences were rated as best fitting lengths for classifying sequences by both the novice and the expert coder. In contrast to the ratings of the coders, the intercoder reliability shows the highest values when encoding sequences with the length of only one sentence.

Conclusion
It has been shown that machine-based classifiers can reach human-like performance for the annotation of complex categories in psycho-social texts.
The results indicate that the models learn to mimic the coding style of the initial creators of the training data. The trained BERT model was even better in coding than a human novice. As in other areas of machine learning, this bears the risk that a model also learns the bias from the training data. Therefore, it is important to understand and regularly check the decisions of the model by human experts.
High coding quality could not be achieved for all codes, however. Especially underrepresented categories, which are common in social sciences, are problematic. Thus, a sufficient number of training samples is an obvious prerequisite for good results.
The typical approach of social sciences in analyzing text corpora consists of coding one text after the other and ignoring unequal frequencies of the individual codes. Our study shows that when using machine learning methods, it is better to generate training examples for as many categories as possible and pay less attention to the complete coding of individual texts. This is an important finding for the organization of future studies in this field.
The investigation of misclassified sequences showed that many recorded misclassifications actually were minor mistakes. The model frequently chose not the actual but a very similar category such that even human experts would regard the assignment plausible. Thus, codes with very similar meanings must be distinguished more sharply to give the model a chance to learn to differentiate.
The analysis of the misclassified sequences of BERT opens up new perspectives for the social sciences: More than half of the "incorrectly classified sequences" appeared to the human expert to be plausible or at least worthy of consideration. Since the discussion of the understanding of individual text passages is an important element of social science research, such plausible misinterpretations can enrich the research process. They offer an alternative way of looking at reality and force the human coder to either rethink his assessments or to better justify them.
Currently, we are working on improving the classification performance. One approach is the generation of additional training data for underrepresented categories. Another idea is using an ensemble of SVM and BERT as a classifier to better utilize the individual strengths of the different models. In any case, the findings on how the models work and perform help to consider such technical aspects in future social science research.
With regard to the application domain, we can conclude that it is definitely possible to analyze online counseling conversations with the help of machine learning. We intend to use machine learning in future research projects to investigate correlations between the different techniques used by counselors and the characteristics and reactions of clients. In addition to the question of whether successful counselors use certain techniques significantly more often than others, it can now be clarified if certain approaches are particularly promising for certain target groups or specific problems. These findings can be integrated into the education of online counselors. Furthermore, assistance systems are conceivable that support online counselors in real-time with information generated from this data.
In any case, the results of this study have shown that it is possible to merge the advantages of qualitative and quantitative approaches in social science with the help of machine learning. Automated data annotation for qualitative analysis is the cornerstone for future insights on an unprecedented level.