Text Zoning and Classification for Job Advertisements in German, French and English

We present experiments to structure job ads into text zones and classify them into pro- fessions, industries and management functions, thereby facilitating social science analyses on labor marked demand. Our main contribution are empirical findings on the benefits of contextualized embeddings and the potential of multi-task models for this purpose. With contextualized in-domain embeddings in BiLSTM-CRF models, we reach an accuracy of 91% for token-level text zoning and outperform previous approaches. A multi-tasking BERT model performs well for our classification tasks. We further compare transfer approaches for our multilingual data.


Introduction
Text mining on job advertisements has become important to analyze labor market demand, since job ads provide unique job-level data on employers' staff needs (Atalay et al., 2020;Das et al., 2020;Calanca et al., 2019;Dawson et al., 2019). Our proposed techniques will be useful to study how job tasks and skill demand have developed in Switzerland in different labor market segments over the last decades. We present preparatory work for precise skill and task extraction: First, we structure job ads into text zones, that is, text parts dedicated to particular topics. Second, we classify job ads into professions, industries and management function. By replacing human annotation with scalable NLP, more fine-grained analyses on big data will be feasible.
Job ads contain information on topics such as the company, the job, or required qualifications. For an accurate extraction of skills and tasks, we need to identify the corresponding text zones, as many key terms are ambiguous, for instance 'dynamic' might refer to a personality trait or to a dynamic CRM system. Information on different topics can be densely packed in sentences, thus it seems most reasonable to formalize text zoning for job ads as token-level sequence labeling. In addition to this structuring of job ads, we need automatic classifications of job ads to enable detailed analysis, most importantly into professions, but also into industries and management functions.
For Swiss data in German, French, English and Italian we need multilingual approaches. Most (labeled) data however is in German. To avoid sparse data problems for minority languages, we thus experiment with transfer approaches.
The empirical experiments presented a) investigate the benefit of contextualized embeddings for text zoning, b) compare multilingual modeling with machine translation based approaches, and c) explore the potential of multi-task models for sequence labeling and text classification. Gnehm (2018) achieved an accuracy of 89.8% for the text zoning task at hand, namely token-level sequence labeling of job ads into eight zones with BiLSTMs, task-specific word embeddings and ensembling. 1 Hermes and Schandock (2017) segment on paragraph level, distinguish four classes, and reach accuracy of 97% with KNN in a multilabel classification. Grüger and Schneider (2019) extract HTML lists for IT job ads. Distinguishing between four list classes, they reach accuracy of 95% with LinearSVC. These two less fine-grained approaches are not directly comparable to ours.

Related Work
Classification of professions is often provided by companies, and their methods and performance are not reported (Burke et al., 2020;Das et al., 2020;Calanca et al., 2019). Atalay et al. (2020) use embedding similarity measures to match jobs to 110 classes, and reach an accuracy of 53%.

Experimental Data
We use two job ad data sets, differing in size and data collection method. Both have each advantages for experiments here and for future analyses.
The Swiss Job Market Monitor (SJMM) 2 corpus consists of 80,000 job ads in German, French, English and Italian, from yearly samples representative for the Swiss job market, back to 1950. High-quality human annotations of profession, industry, and management function are available for all job ads. Text zones are annotated on German job ads until 2014. The SJMM provides us hence with labeled data for supervised machine learning experiments, and will allow analyses of how job tasks and skills developed over the last decades.
The Online Ads (OA) corpus contains 9 million ads in German, French, English and Italian from job portals and company websites in Switzerland crawled since 2012 by a private company. This big data set seems valuable for building in-domain embeddings for our experiments, and makes finegrained analyses feasible for future research.
Text zoning in the SJMM is operationalized as introduced in Gnehm (2018). Eight zones are distinguished based on their content, and the text is segmented on token level. 3 Token level seems most appropriate, as information on different zones can be densely packed in single sentences. Not every ad contains information on every zone (e.g. not every job ad specifies personality traits of the ideal candidate) and zone distribution is strongly skewed: The job description (z6) comprises with more than 30% the largest share of tokens, whereas the least frequent zone, reason of the vacancy (z2), comprises 0.5% of tokens. Tokens show high zone ambiguity, with more than 90% of tokens showing up in more than one zone.
In text zoning experiments, we use the data split of Gnehm (2018), for comparability: Aiming for a model optimized for future application, dev and test set (test set A) are restricted to each 10% (n=650) of the most recent available data (2010)(2011)(2012)(2013)(2014), the remaining 80% and all data further back to 1970 (n=22,700) serve as training data. In pure text classification experiments, we can use all multilingual SJMM data from the time span of interest (1990-2018, n=34,600)

Text Representation
We experiment with different text embeddings: Static type-level fastText (FT) embeddings provide a single vector for all occurrences of a word (Bojanowski et al., 2017). Contextualized embeddings allow to represent different word senses by capturing the semantics of surrounding text. We contrast BERT sub-word embeddings (Devlin et al., 2019) with character-based FLAIR embeddings (Akbik et al., 2018).
Given the large amount of in-domain texts, we train FLAIR embeddings on both of our corpora (FLAIRSJMM, FLAIROA). 5 We systematically compare the effect of these in-domain vs. general domain embeddings mentioned above.
Qualitative evidence for the usefulness of contextualized in-domain embeddings for text zoning is provided in Figure 1 with a UMAP (McInnes et al., 2018) vector visualization of the semantic space. The term 'Ansprechpartner' (contact person) in the zone for job description (z6) depicts that part of the job is to serve as contact person, probably for clients or co-workers, and the same term in the zone for the wanted personality (z8) hints furthermore that this person should be approachable or trustworthy. Contact person in the residual text (z3) however, simply refers to a contact information for the application procedure. The separation of the respective vectors in the semantic space in Figure 1 shows that such zone specific meanings can be recognized with our contextualized in-domain embeddings.  Table 1: Accuracy of text zoning on test set A for different in-domain d and general g embeddings. The results with standard deviation report averages of 3 runs (of 5 runs for baseline by Gnehm (2018)) and column Ens. reports their majority vote ensemble (Rokach, 2010).

Text Zoning and Joint Classification
In this section, we first assess different text representations for our text zoning task, and evaluate the benefits of contextualized embeddings compared to previous work (Gnehm, 2018). We then explore the potential of joint classification, that is, including the classification of job ads into professions, industries and management functions in the sequence labeling text zoning task. Such a multitasking model would be most convenient in practical application. We assume furthermore that all these tasks are somewhat related and simultaneous learning could be beneficial.
For all these experiments we use the sequence labeling architecture proposed by Huang et al. (2015), a bidirectional LSTM with a CRF layer, implemented in the flair NLP library (Akbik et al., 2018). Model selection is based on dev set accuracy, and we evaluate on test set. For selected models we repeat the experiment three times and report mean performance and standard deviation (Reimers and Gurevych, 2017). Text Zoning: For the first series of experiments, results in Table 1 show that models featuring contextualized FLAIR embeddings outperform all others in token-level sequence labeling of text zones. The best setting combines in-domain FLAIR embeddings with general domain FT word embeddings, reaching an ensemble accuracy of 0.91 and improving the baseline of Gnehm (2018) by more than 1 percentage point. 6 This corresponds to earlier findings for PoS tagging and NER by Akbik et al. (2018). They hypothesize that type-level embeddings capture se-6 See Appendix A.4 for per-class results. mantics that is complementary to the characterlevel features of FLAIR.
Interestingly, FLAIROA embeddings built from the much larger online corpus are less useful than FLAIRSJMM, which is probably due to the fact that the SJMM text zoning data consist for the most part of job ads in print media.
The lower performance of the pretrained German BERT might be explained by subtokenization issues. The many compound nouns and abbreviations of our special domain seem to cause problems for building meaningful entities to calculate embeddings over. 7 Using the mean of all sub-token embeddings for a token does not resolve this, but an improvement can be observed if we fine-tune the embeddings to the task. 8 We tried ensemble combinations of models with different input embeddings (not shown), and of models with three runs (see Table 1) The best ensembles reach accuracy of 0.91, indicating limited variance between models. The lack of performance increase is convenient, as running a single classifier is easier than applying ensembles. Joint classification: In this second series of experiments, we investigate if it is beneficial to combine the sequence labeling text zoning task with the classification of industry (11 classes), profession (34 classes) and management function (2 classes) in a single model. To answer this question, we add three special class tokens and their labels at the end of each job ad text. We focus on the best FLAIRSJMM+FT embeddings for text zoning, and assess adding model capacity (layers, hidden states). To direct the model towards learning predictions for the three special tokens, we experimentally increase their class weights w ∈ {10, 50, 100} in the loss function. For technical reasons, this is only applicable on models without CRF layer, hence we also assess the effect of CRF. Different joint models in Table 2 show relatively stable results for text zoning, industry and management function classification, whereas for the more fine-grained profession classification, accuracy depends on the model specifics. Dropping CRF affects accuracy for industry, profession and management function strongly. This shows the interdependence of the three variables represented as neighboring tokens.  Large class weights w ∈ {50, 100} compensate for this performance drop and tune the model to the fine-grained classification tasks. More capacity in the form of larger hidden sizes and additional layers is useful, although the second layer helps only in combination with other factors. 9 By adding model capacity and weighted loss of 50 for the classification tasks, we find the model that performs best regarding profession classification, with relatively good results for all other tasks.

Text Classification
Results in Section 4.2 suggest that simultaneous learning of profession, industry and management function classification might be beneficial, but not enough model capacity is devoted to these tasks when including them into sequence labeling. Therefore, we experiment in the following with multi-tasking text classification for these three tasks. In monolingual experiments, we assess different multi-tasking models for classification of profession, industry and management function, and benchmark them with respective single task models. At last, we conduct multilingual experiments for profession classification. The SJMM data set is multilingual, but most labeled data (75%) is available for German. Hence we test different transfer approaches to avoid sparse data problems.
With the text classification implementation by Flair (Akbik et al., 2018), we obtain document level representations for job ads by feeding FLAIR or FT embeddings into an RNN. For BERT embeddings, we take the topmost layer of the transformer model and fine-tune embeddings during training. Document embeddings are extracted from the '[CLS]' token. In both cases, actual class labels are calculated by a linear layer on top. Monolingual Experiments: We compare single vs. multi-tasking classification models using the 9 See ablation study in Table 11 Table 3: Accuracy for profession (34 classes), industry (11 classes) and management function (2 classes) in single (sT) and multi-task (mT) classification on test set B best embeddings from previous experiments. 10 In multi-tasking, we simultaneously predict profession (34 classes), industry (11 classes) and management function (2 classes). We feed each job ad once for each task into the data, adding each time a special token that specifies the task to learn.
With text classifiers and BERT embeddings, we reach an accuracy of 0.778 for professions (see Table 3. Although test sets A and B are not directly comparable, this surpasses sequence labeling results. For the other, somewhat less important variables, accuracy here is slightly lower. 11 BERT outperforms in multi-and single task classification our domain-specific contextualized embeddings, probably because BERT embeddings get fine-tuned to the task during training. Multitasking does not seriously alter profession classification, and the multi-tasking BERT reaches similar accuracy for industry and management function as single-task classifiers. It is thus reasonable to a go for the BERT multi-tasking classifier. A detailed error analysis for professions further strengthens trust in the model. First, prediction probabilities and errors are strongly correlated: While for p ≥ 0.9 error rate is only 12%, with p ≤ 0.5 error rate is 75%. Thus, probabilities 87 are useful for error detection. Second, a human post-evaluation of a random sample of 20 errors with p ≥ 0.9 showed that only 10% of these errors are considered hard errors. In 90% of the cases, several class labels can be seen as correct options, and the model prediction is appropriate. This underlines that our model copes well with a sometimes ambiguous classification task. Multilingual Experiments: On a classification task for 11 professions, we compare two approaches. 12 First, we use machine translation (MT) (DeepL) to translate French and English job ads to German, and apply a classifier trained for German. We test in this approach further, if familiarizing the classification model during training with partially awkward wording ('Translationese') helps, by including automatic translations in our train (and dev) set. 13 Second, we train multilingual classifiers on our German, French and English data with general-domain, multilingual FLAIR and BERT embeddings.
In the MT approach, accuracy decreases strongly, for French around 10, for English even up to 20 percentage points (see Table 4). One reason for the stronger effect in English is that class distribution differs from German. 14 Adding translated ads indeed helps, and raises accuracy by 9 points for English (BERT) and French (FLAIRSJMM+FT). Why French results vary more with FLAIRSJMM+FT and English results more with BERT needs further investigation.
Best performing are multilingual BERT for French (0.744), and BERT with Translationese for English (0.693). Multilingual models are a convenient solution, because no MT is needed for their application. For the MT approach, including translated ads in training seems necessary, especially if class distributions differ between languages. Either way, due to being fine-tuned to the task, BERT outperforms our domain-specific FLAIR embeddings. 15

Conclusion
Contextualized embeddings facilitate precise information extraction. Our best single text zoning 12 For the sake of sound evaluation, we choose here a broader classification scheme, and restrict experiments to French and English (the amount of ads in Italian is too small). 13 Adding 4,100 (500) ads from French, 2,900 (350) from English to the original 20,700 (2,600) from German. 14 See Table 13 in Appendix. 15 The multilingual BERT without fine-tuning reaches accuracies below 0.3 for the 3 languages.   (2018) and reduce the relative error rate by 12%. The combination of sequence labeling for text zoning and text classification for professions, industries and management function in a single multi-task model did not lead to entirely satisfying results. But, we found a multi-tasking BERT text classifier that performs well and provides a convenient solution for structuring our corpus into professions, industries, and management function. Error analysis for profession classification raised trust in this model. The model's classification probabilities provide valuable information for post-validation and subsequent analyses. Multilingual experiments showed that our classifiers are affected by MT. Utilizing translated material in training, or alternatively multilingual models, are potential strategies, but the question of the best transfer approaches for our multilingual data needs further investigation.
The most promising approach for future work seems to be the training of our own domainspecific BERT embeddings, both for optimizing classification and for intended subsequent skill and task extraction. This way, we can also exploit the large amount of data in the OA corpus. Another direction worthy to explore is multi-tasking, be it by including more variables, or by experimenting with more sophisticated approaches (Clark et al., 2019;Liu et al., 2019b

A.2 Preprocessing & Training Parameters
Training parameters are set according to recommendations in the Flair library (Akbik et al., 2018) unless reported differently here.
Text representations: For FLAIRSJMM and FLAIROA we train forward and backward language models with LSTMs with one layer and 2048 hidden states on the SJMM (67MB) and the OA (4GB) corpus. Preprocessing is kept simple: We map digits to 0, white space to single blanks, and replace web and e-mail addresses with special tokens (replaced-dns, replaced-email, replaced-url). We build our own domain-specific character dictionary, setting the rarest 0.0001% of characters to unknown.
We optimize with SGD, clip gradients at 0.25 and set dropout probability to 0.25. Sequence length is set to 250 and batch size to 100. We train our language models with a learning rate of 20 for 2 weeks, reaching perplexity of 1.73 (forward model), and 1.74 (backward model) for FLAIRSJMM and 1.45 and 1.46 on validation sets for forward and backward models of FLAIROA.
General-domain FLAIR embeddings are provided by Akbik et al. (2018), for German we use embeddings that are pretrained on a mixed corpus (Web, Wikipedia, Subtitles) and in the multilingual setting embedddings that are pretrained on JW300 corpus. For German BERT embeddings, we use the model trained by Deepset.ai with 12 layers, 768 hidden states, 12 heads and 110M parameters, for multilingual BERT embeddings a model with the same configurations, trained on cased text in 104 languages.
FT are German FastText embeddings without character feature provided in the Flair library. Sequence labeling: We optimize with SGD, clipping gradients at 5. Minibatch size is 32 and training starts with learning rate of 0.1 and is annealed with factor 0.5 after 5 periods with no loss decrease. We stop training after 150 epochs, or as soon as the learning rate ≤ 0.0001. We use variational dropout (p = 0.5) and word dropout (p = 0.05) for regularization. Text classification: For all models with FLAIR embeddings (FLAIRSJMM, FLAIRSJMM+FT, multilingual FLAIR), training parameters are as described above for sequence labeling. Classifier with German or multilingual BERT embeddings are optimized with Adam over 5 epochs, with a learning rate of 3.00E-05, in minibatches of 16.