Competence-Level Prediction and Resume&Job Description Matching Using Context-Aware Transformer Models

This paper presents a comprehensive study on resume classification to reduce the time and labor needed to screen an overwhelming number of applications significantly, while improving the selection of suitable candidates. A total of 6,492 resumes are extracted from 24,933 job applications for 252 positions designated into four levels of experience for Clinical Research Coordinators (CRC). Each resume is manually annotated to its most appropriate CRC position by experts through several rounds of triple annotation to establish guidelines. As a result, a high Kappa score of 61% is achieved for inter-annotator agreement. Given this dataset, novel transformer-based classification models are developed for two tasks: the first task takes a resume and classifies it to a CRC level (T1), and the second task takes both a resume and a job description to apply and predicts if the application is suited to the job T2. Our best models using section encoding and multi-head attention decoding give results of 73.3% to T1 and 79.2% to T2. Our analysis shows that the prediction errors are mostly made among adjacent CRC levels, which are hard for even experts to distinguish, implying the practical value of our models in real HR platforms.


Introduction
An ongoing challenge for Human Resource (HR) is the process used to screen and match applicants to a target job description with a goal of minimizing recruiting time while maximizing proper matches. The use of generic job descriptions not clearly stratified by the level of competence or skill sets often leads many candidates to apply every possible job, resulting in misuse of recruiter and applicant's time. A more challenging aspect is the evaluation of unstructured data such as resumes and CVs, which represents about 80% of the data processed daily, a task that is typically not an employer's priority given the manual effort involved (Stewart, 2019).
The current practice for screening applications involves reviewing individual resumes via traditional approaches, that rely on string/regex matching. The scope of posted job positions varies by the hiring organization type, job level, focus area, and more. The latest advent in Natural Language Processing (NLP) enables the large-scale analysis of resumes (Deng et al., 2018;Myers, 2019). NLP models also allow for a comprehensive analyses on resumes and identification of latent concepts that may easily go unnoticed using a general manual process. This model's ability to infer core skills and qualifications from resumes can be used to normalize necessary content into standard concepts for matching with stated position requirements (Chifu et al., 2017;Valdez-Almada et al., 2018). However, the task of resume classification has been under-explored due to the lack of resources for individual research labs and the heterogeneous nature of job solicitations.
This paper presents new research that aims to help applicants identify the level of job(s) they are qualified for and to provide recruiters with a rapid way to filter and match the best applicants. For this study, resumes submitted to four levels of Clinical Research Coordinator (CRC) positions are used. To the best of our knowledge, this is the first time that resume classification is explored with levels of competence, not categories. The contributions of this work are summarized as follows: • To create a high-quality dataset that comprises 3,425 resumes annotated with 5 levels of real CRC positions (Section 3).
• To present a novel transformer-based classification approach using section encoding and multi-head attention decoding (Section 4).
• To develop robust NLP models for the tasks of competence-level classification and resumeto-job_description matching (Section 5).

CRC1
Manage administrative activities associated with the conduct of clinical trials. Maintain data pertaining to research projects, complete source documents/case report forms, and perform data entry. Assist with participant scheduling.

CRC2
Manage research project databases and development study related documents, and complete source documents and case report forms. Interface with research participants and study sponsors, determine eligibility, and consent study participants according to protocol.

CRC3
Independently manage key aspects of a large clinical trial or all aspects of one or more small trials or research projects. Train and provide guidance to less experienced staffs, interface with research participants, and resolve issues related to study protocols. Interact with study sponsors, monitor/report SAEs, and resolve study queries. Provide leadership in determining, recommending, and implementing improvements to policies and procedures.

CRC4
Function as a team lead to recruit, orient, and supervise research staff. Independently manage the most complex research administration activities associated with the conduct of clinical trials. Determine effective strategies for promoting/recruiting research participants and retaining participants in long term clinical trials. Prior studies in this area have focused on classifying resumes or job descriptions into occupational categories (e.g., data scientist, healthcare provider). However, no work has yet been found to distinguish resumes by levels of competence. Furthermore, we believe that our work is the first to analyze resumes together with job descriptions to determine whether or not the applicants are suitable for particular jobs, which can significantly reduce the intensive labor performed daily by HR recruiters.

Data Collection
Between April 2018 and May 2019, the department of Human Resources (HR) at Emory University received about 25K applications including resumes in free text for 225 Clinical Research Coordinator (CRC) positions. A CRC is a clinical research professional whose role is integral to initiating and managing clinical research studies. There are four levels of CRC positions, CRC1-4, with CRC4 having the most expertise. Table 1 gives the descriptions about these four CRC levels. Table 2 shows the statistics of the collected applications and the resumes. Out of the 24,933 applications, 89% are applied for the entry level positions, CRC1-2, that is expected since CRC3-4 positions require more qualifications (A). At any time, there are various positions posted for the same level from different divisions, cardiology, renal, infectious disease, etc. Thus, it is common to see resumes from the same applicant applying to several job postings within the same CRC level.
After removing duplicated resumes within the same level, 9,286 resumes remain, discarding 63% of the original applications (B). It is common to see the same applicant applying to positions across multiple levels. After removing duplicated resumes across all levels and retaining only the resumes to the highest level (e.g., if a person applied for both CRC1 and CRC2, retain the resume for only CRC2), 6,492 resumes are preserved, discarding additional 11% from the original applications (C).  For our research, we carefully select 3,425 resumes from C by discarding ones that are not clearly structured (e.g., no section titles) or contain too many characters that cannot be easily converted into text, while keeping similar ratios of the CRC levels (C r ). We also create a set similar to B, say B r , that retains only resumes in C r . C r and B r are used for our first task ( §4.1) and second task ( §4.2), respectively.

Preprocessing
The resumes collected by the HR come with several formats (e.g, DOC, DOCX, PDF, RTF). All resumes are first converted into the unstructured text format, TXT, using publicly available tools. They are then processed by our custom regular expressions designed to segment different sections in the resumes. As a results, every resume is segmented into the six sections, Profile, Education, Work Experience, Activities, Skills, and Others. Table 3 shows the ratio of resumes in each level including those sections.

Annotation
2 experts with experience in recruiting applicants for CRC positions of all levels design the annotation guidelines in 5 rounds by labeling each resume with either one of the four CRC levels, CRC1-4, or Not Qualified (NQ), indicating that the applicant is not qualified for any CRC level. Thus, a total of 5 labels are used for this annotation. For each round, 50 randomly selected resumes from C r in Table 2, by keeping similar ratios of the CRC levels as C r , are labeled by those two experts with improvement to subsequent guidelines based on their agreement. Another batch of 50 resumes are then selected for the next round and annotated based on the revised guidelines. For batches 2-5, a third person (non-expert) is added and instructed to follow the guidelines developed from prior rounds; thus, annotation is completed by three people for these rounds. Table 4 shows the Fleiss Kappa scores to estimate the inter-annotator agreement (ITA) for each round with respect to the five competence levels.  For R1 with no guidelines designed, poor ITA is observed with the kappa score of 16.1%. The ITA gradually improves with more rounds, and reaches the kappa score of 60.8% among 3 annotators, indicating the high quality annotation in our dataset. The followings give brief summary of the guideline revisions after each round: Round 1 (1) Clarify qualified and not-qualified applicants, (2) Define transferable skills (e.g, general research experience vs. experiences in healthcare), (3) Define clinical settings, clinical experience, and clinical research experience (4) Set requirements by levels of academic preparation.
Round 2 (1) Revise the length of clinical experience based on levels of academic preparation and whether the degree is in a scientific/health related field or non-scientific/non-health related field, (2) Refine CRC2-4 degree requirements, years of clinical research, and clinical experience requirements, Round 4 (1) Remove clinical experience requirements from CRC2-4 and require a minimum of 1-year clinical research for those with a scientific vs. non-scientific degree, (2) Revisit laboratory scientist requirements, (3) Remove academic experience as a research assistant unless it involved over 1000 hours. Rationale: participation by semester is typically data entry or participation in a component of the research but not full engagement in a project.
Round 5 Increase the number of years required for a bench/laboratory researcher.
During these five rounds, 250 resumes are triple annotated and adjudicated. Given the established annotation guidelines, 1 additional 3,175 resumes are single annotated and sample-checked. Thus, a total of 3,425 resumes are annotated for this study.

Approach
This section introduces transformer-based neural approaches to address the following two tasks: T1 Given a resume, decide which level of CRC positions that the corresponding applicant is suitable for (Section 4.1).
T2 Given a resume and a CRC job description, decide whether or not the applicant is suitable for that particular job (Section 4.2).
T1 is a multiclass classification task where the labels are the five CRC levels including NQ (Table 4). This task is useful for applicants who may not have clear ideas about what levels they are eligible for, and recruiters who want to match the applicants to the best suitable jobs available to them. T2 is a binary classification task such that even with the same resume, the label can be either positive (accept) or negative (reject), depending on the job description. This task is useful for applicants who have good ideas about what CRC levels they fit into but want to determine which particular jobs they should apply to, as well as recruiters who need to quickly screen the applicants for interviews.

Competence-Level Classification
For the competence-level classification task (T1), a baseline model that treats the whole resume as one document ( §4.1.1) is compared to context-aware models using section pruning ( §4.1.2), chunk segmenting ( §4.1.3), and section encoding ( §4.1.4). 1 The annotation guidelines are available at our project page.

Whole-Context: Section Trimming
Let N be the maximum number of input tokens that a transformer encoder can accept. Then, n i , the max-number of tokens in S i allowed to be input, is measured as follows: All trimmed sections are appended in order with the special token c, representing the entire resume, which creates the input list I = {c}⊕S 1 ⊕· · ·⊕S m . I is fed into the transformer encoder (TE) that generates the list of embeddings . . , e r in i } is the embedding list of S i , and e c is the embeddings of c. Finally, e c is fed into the linear decoder (LD t ) that generates the output vector o t ∈ R d to classify R into one of the competence levels (in our case, d = 5).

Context-Aware: Section Pruning
Section trimming in Section 4.1.1 allows the wholecontext model to take part of every section as input. However, it is still limited because not all features necessary for the classification are guaranteed to be in the trimmed range. Moreover, this model makes no distinction between contents from different sections once S 1..m are concatenated. This section proposes a context-aware model to overcome those two issues by pruning tokens more intelligently and encoding each section separately so that the model learns weights for individual sections to make more informed predictions. Figure 2 shows an overview of the context-aware model using section pruning.  Given the maximum number of tokens, N , that the transformer encoder (TE) allows, any section S i ∈ R that contains more than N -number of tokens is pruned by applying the following procedure: 1. If |S i | > N , remove all stop words in S i .
2. If still |S i | > N , remove all words whose document frequencies are among the top 5%.
3. If still |S i | > N , remove all words whose document frequencies are among the top 30%.
Then, the pruned section S i is created for every S i , where S i ⊆ S i and |S i | ≤ N . Each S i is prepended by the special token c i representing that section and fed into the transformer encoder (TE) that generates the list {e c i , e r i1 , . . . , e r iN }, where e c is the embedding of c, called section embedding, and the rest are the embeddings of S i . Let e c Σ = m i=1 e c i , that is the sum of all section embeddings representing the whole resume. Finally, e c Σ is fed into the linear decoder (LD p ) that generates the output vector o p ∈ R d to classify R into a competence level.

Context-Aware: Chunk Segmenting
Section pruning in §4.1.2 preserves relevant information more than section trimming in §4.1.1; however, the model still cannot see the entire resume.
Thus, this section proposes another method that uniformly segments the resume into multiple chunks and encodes each chunk separately. Figure 3 shows the context-aware model using chunk segmenting. Let S i = {S i.1 , . . . , S i.k } be the i'th section in R, where S i.j is the j'th chunk in S i and k = |S i | /L given the maximum length L of any chunk so that |S i.j | = L for ∀j < k and |S i.k | ≤ L. 2 Each chunk S i.j is prepended by the special token c i.j representing that chunk and fed into TE that generates the embedding list E i.j = {e c i.j , e r i.j1 , . . . , e r i.jL }. Let e c Σ = ∀i∀j e c i.j . Finally, e c Σ is fed into LD s that generates the output vector o s ∈ R d to classify R.

Context-Aware: Section Encoding
Chunk segmenting in §4.1.3 allows the model to see the entire resume; however, it loses information about which sections the chunks belong to. This section proposes a method to distinctively encode chunks from different sections, that can be applied to both models using section pruning ( §4.1.2) and chunk segmenting. Figures 2 and 3 Finally, e c+I Σ is fed into LD se that create o pe ∈ R d , o se ∈ R d for the section pruning and chunk segmenting models, respectively.

Resume-to-Job_Description Matching
For the resume-to-job_description matching task (T2), the whole-context model is adapted to establish the baseline ( §4.2.1), and compared to contextaware models using chunk segmenting + section encoding coupled with the job description embedding ( §4.2.2), as well as multi-head attentions between the resume and the job description ( §4.2.3).

Whole-Context: Sec./Desc. Trimming
The whole context model is similar to the one using section trimming in §4.1.1 with the additional input from the job description, illustrated as the dotted boxes in Figure 1. Let B = {b 1 , . . . , b b } be the job description where b i is the i'th token in B. Given the max-number of tokens N that a transformer encoder can accept, the max-numbers of tokens in S i and B, that are n i and n b respectively, allowed to be input are measured as followed: Then, the input list I = {c} ⊕ S 1 ⊕ · · · ⊕ S m ⊕ B is created and fed into TE that generates the embedding list Finally, e c is fed into LD that generates o t ∈ R 2 to make the binary decision of whether or not R is suitable for B.

Context
Then, the job description embedding e c b is concatenated with the section encoded resume embedding e c+I Σ ( §4.1.4) and fed into LD be that generates o be ∈ R 2 . Figure 4 depicts an overview of the context-aware model using the techniques in §4.2.2 empowered by multi-head attention (Vaswani et al., 2017) between the resume R and the job description B, which allows the model to learn correlations between individual tokens in R and B, r * and b * , as well as the chunk and job description representations, c * .

Context-Aware: Multi-Head Attention
Let E r ∈ R γ×λ be the matrix representing R, where γ is the total number of chunks across all sections in R, λ = L + 1, and L is the max-length of any chunk. Thus, each row in E r is the embedding list E i.j ∈ R 1×λ of the corresponding chunk S i,j . Let E b ∈ R γ×ν be the matrix representing B where ν = N +1 and N is the max-length of B. Each row in E b is a copy of the embedding list E b ∈ R 1×ν in §4.2.2. Thus, every row is identical to the other rows in E b . These two matrices, E r and E b , are fed into two types of multi-head attention (MHA) layers, one finding correlations from R to B (R2B) and the other from B to R (B2R), which generate two attention matrices, A r2b ∈ R γ×λ and A b2r ∈ R γ×ν .
The  Table 5 shows the data split used to develop models for the competence-level classification task (T1). The annotated data in the row C r of Table 2   70% of the data are annotated with the entry levels, CRC1 and CRC2, that is not surprising since 77.3% of the applications are submitted for those 2 levels. The ratio of CRC4 is notably lower than the application ratio submitted to that level, 6.8%, implying that applicants tend to apply to jobs for which they are not qualified. 13.9% of the applicants are NQ; thus, if our model detects even that portion robustly, it can remarkably reduce human labor. Table 6 shows the data split used for the resumeto-job_description matching task (T2). The same ratios of 75:10:15 are applied to generate the TRN: DEV:TST sets, respectively. Note that an applicant can submit resumes to more than one CRC level. Algorithm 1 is designed to avoiding any overlapping applicants across datasets while keeping the similar label distributions (Appendix A.1).  Table 6: Data statistics for the resume-to-job_ description matching task (T2) in Section 4.2. Y/N: applicants whose applied CRC levels match/do not match our annotated label, respectively.

Data Distributions
Out of the 5,362 applications, 40.5% of them match our annotation of the CRC levels, indicating that less than a half of applications are suitable for the positions they apply. The number of matches drops significantly for CRC2; only 14.5% are found to be suitable according to our labels. Too few instances are found for CRC4; only 4.3% of the applicants applying for this level match our annotation.

Models
For our experiments, the BERT base model is used as the transformer encoder (Devlin et al., 2019) although our approach is not restricted to any particular type of encoder. The following models are developed for T1 (Section 4.1): • The P⊕I⊕J model adapts section pruning to generate e c+I Σ instead of chunk segmenting in §4.2.2. For the P⊕I⊕J⊕A model, the attention matrices in §4.2.3 are reconfigured as A r2b , A b2r ∈ R m×ν (m: the number of sections in R). These models are developed to make comparisons between those two approaches for T2. Also, the * E models exclude the embedding list E c such that f i.j is redefined as f i.j = e I i + a r2b i.j + a b2r i.j in §4.2.3 to estimate the pure impact of multi-head attention.

Results
Labeling accuracy is used as the evaluation metric for all our experiments. Each model is developed three times and their average score as well as the standard deviation are reported. 3 Table 7 shows the results for T1 achieved by the models in Sec. 5.2. All context-aware models without section encoding perform significantly better, 1.5% with section pruning (P) and 3.3% with chunk segmenting (C), than the baseline model (W r ). C shows a greater improvement of 1.8% than P, implying that the additional context used in C is essential for this task. Section encoding (I) helps both P and C. As the result, C⊕I shows 4.2% improvement over W r and also gives the least variance of 0.16.

DEV
TST δ  Table 7: Accuracy (± standard deviation) on the development (DEV) and test (TST) sets for T1, achieved by the models in Section 5.2. δ: delta over W r on TST. Table 8 shows the results for T2 achieved by the models in Section 5.2. Neither the context-aware model using section pruning (P) or chunk segmenting (C) with section encoding (⊕I) performs better than the baseline model (W r+b ) by simply concatenating the job description embedding (⊕J). Indeed, none of the P⊕ * models performs better than W r+b , that is surprising given the success they depict for T1 (Table 7). However, C with multi-head attention (C⊕I⊕J⊕A) show a significant improvement of 4.6% over its counterpart, that is very encouraging.  Table 8: Accuracy (± standard deviation) on the development (DEV) and test (TST) sets for T2, achieved by the models in Section 5.2. δ: delta over W r on TST.
Multi-head attention (A) gives good improvement to P as well. Interestingly, the one excluding the 3 Appdendix A.2 provides details of our experimental settings for the replicability of this work. embedding list ( E) performs slightly better than the one including it (P⊕I⊕J⊕A), implying that the embeddings from the pruned sections are not as useful once the attention is in place. Figure 5 shows the confusion matrix for T1's best model, C⊕I. The prediction of CRC1 shows robust performance, which has the most number of training instances (Table 5), whereas the other dimensions are mostly confused around their neighbors, often hard to distinguish even for human experts.

Error Analysis
This section provides a detailed analysis from our experts about prediction errors made by our best model in Section 5.3.

General
The following observations are found as general error cases: • Classifying foreign trained MDs and persons with PhDs with no clinical research experience to overrate them. (1) It picks up research project done in training as significant research.
(2) It is unable to identify clinical research experience.
• Classifying laboratory personnel entering CRC area.
• Counting research experience: identifying dates of experience.
(2) It needs implications for creating a structured entry form versus resume (3) Academic research experiences that are less than 1000 hours not counted; a semester experience not counted. (4) It needs to count paid research experience.
• Not picking up research related titles or terms such as (1)

Conclusion
This paper proposes two novel tasks, competencelevel classification (T1) and resume-description matching (T2), and provides a high-quality dataset as well as robust models using several transformerbased approaches. The accuracies achieved by our best models, 73.3 for T1 and 79.2 for T2, show a good promise for these models to be deployed in real HR systems. To the best of our knowledge, this is the first time that those two tasks are thoroughly studies, especially with the latest transformer architectures. We will continuously explore to improve these models by integrating expert's knowledge.

A Appendices
A.1 Spliting Algorithm for T2 Algorithm 1 is to split the TRN/DEV/TST sets for T2 (Table 6) without overlapping applicants across them while keeping the label distributions. The key idea is to split the data by targeted label distributions but with a smaller training set ratio than the original one. If there are overlapping applicants, then it puts all of the overlaps into the training set so that the training set ratio will be large enough to be close to the targeted training set ratio while the label distributions are still kept in a great extent.  Table 9 shows the hyper-parameters used for each model (Section 5.2). For chunk segmenting in Section 4.1.3, let k i be the number of chunks in the i'th section, then K = m i=1 k i is the total number of chunks in R. To utilize the GPU memory wisely, resumes with the same K are put to the same batch and different batches are trained with different batch sizes based on K and GPU memory to maximum the GPU usage. Different seeds are used when developing models for three times.

A.3 Analysis on Section Pruning
Section pruning is used to discard insignificant tokens in order to meet the limit of input size required by the transformer encoder (Section 4.1.2). Tables 10 and 11 show the section lengths before and after section pruning, respectively. These tables show that section pruning can noticeably reduce the maximum and average lengths of the sections.   Table 11: Section lengths after section pruning (Section 4.1.2). Average/Max: the average and max lengths of input sections. Ratio: the ratios of input sections that are under the max-input length restricted by the transformer encoder.