Hiring Now: A Skill-Aware Multi-Attention Model for Job Posting Generation

Writing a good job posting is a critical step in the recruiting process, but the task is often more difficult than many people think. It is challenging to specify the level of education, experience, relevant skills per the company information and job description. To this end, we propose a novel task of Job Posting Generation (JPG) which is cast as a conditional text generation problem to generate job requirements according to the job descriptions. To deal with this task, we devise a data-driven global Skill-Aware Multi-Attention generation model, named SAMA. Specifically, to model the complex mapping relationships between input and output, we design a hierarchical decoder that we first label the job description with multiple skills, then we generate a complete text guided by the skill labels. At the same time, to exploit the prior knowledge about the skills, we further construct a skill knowledge graph to capture the global prior knowledge of skills and refine the generated results. The proposed approach is evaluated on real-world job posting data. Experimental results clearly demonstrate the effectiveness of the proposed method.


Introduction
Writing high-quality job postings is the crucial first step to attract and filter the right talents in the recruiting process of human resource management. Given job descriptions and basic company information, the key to the job posting is to write job requirements, which requires to specify professional skills properly. Both too many or few requirements may lead to negative impacts on talent recruiting. Because of the extremely large number of job positions and varieties of professional skills, a lot of  companies have to pay much cost in this step to win in the war of talents.
To this end, we propose the task of Job Posting Generation (JPG) in this paper, and we cast it as a novel conditional text generation task that generates the job requirement paragraph. Exploiting the ubiquitous job posting data, we aim to automatically specify the level of necessary skills and generate fluent job requirements in a data-driven manner, as shown in Figure 1.
Although the JPG task is of great significance, the complexity of it poses several key challenges: 1) Generating job requirements needs to not only produce overall fluent text but also precisely organize the key content like skills and other information, which is very difficult to current neural systems. Especially, the long-text to long-text generation easily leads to information missing (Shen et al., 2019). 2) The key points of job descriptions and the skills of job requirements are complex many-tomany relations, which makes the mapping learning very difficult. 3) How to exploit the global information among the heterogeneous relations between basic company information and the professional skills across the whole dataset is of great importance to generate high-quality job requirements.
To address these challenges, we focus on the richness and accuracy of skills in generated job requirements and propose a global Skill-Aware Multi-Attention (SAMA) model for JPG task. Specifically, we devise a two-pass decoder to generate informative, accurate, and fluent job requirement paragraph. The first-pass decoder is to predict multiple skills according to the job description, which is a multi-label classification task (Zhang and Zhou, 2014). The second-pass decoder is to generate a complete text according to the predicted skill labels and the input text. Moreover, we build a skill knowledge graph to capture the global information in the whole job posting dataset in addition to the local information provided by the input. Through the skill knowledge graph, our model obtains the global prior knowledge to alleviate the misusing of skills. Extensive experiments are conducted to evaluate our model on real-world job posting data. The result demonstrates the effectiveness of the proposed method.
The main contributions of this paper can be summarized as follows: • We propose a novel task of job posting generation that is defined as the conditional generation given a job description and basic company information to generate a job requirement.
• A data-driven generation approach SAMA is proposed to model the complex mapping relationships and generate informative and accurate job requirements.
• We build a real-world job posting dataset and conducte extensive experiments to validate the effectiveness and superiority of our proposed approach.

Data Description
We collect a job posting dataset from a famous Chinese online recruiting market, across a period of 19 months, ranging from 2019 to 2020. There are 107,616 job postings in total. After removing repetitive and too short job postings, 11,221 records are selected. This dataset is collected from 6 different industry domains. The detailed statistics of the dataset are illustrated in Table 1.
Considering the importance of the skills for JPG, we select 2000 records and manually tag the skills in these records. Then we train a word-level LSTM-  CRF model (Huang et al., 2015) to recognize the skills in the whole dataset.
We also keep the basic information, i.e., job position and company scale information, for the reason that they are the critical attributes of job postings that have impacts on the level of skills.
In order to capture the global prior knowledge of skills, we construct a skill knowledge graph according to the semantic relations of entities in the job postings. As shown in Figure 2, there are three types of entities, i.e., skill, company scale, and job position. The entities of skills are divided into two types, generic skills (denoted by G) and

Approach
is the word sequence of job description paragraph. Y i = (y i,1 , y i,2 , ..., y i,n ) is the word sequence of job requirement paragraph, B i = (b p i , b s i ) is the basic information, b p and b s are job position and company scale information, N is the size of dataset, m and n are the lengths of sequence X i and Y i , respectively. The target of the JPG task is to estimate P (Y i |X i , B i ), the conditional probability of a  Firstly, considering the importance of skill prediction in JPG, we decompose the probability P (Y i |X i , B i ) into a two-stage generation process, including skill prediction and job requirement paragraph generation: (1) where S i = (s i,1 , s i,2 , ..., s i,l ) is a skill 2 word sequence of its corresponding job requirement, l is the length of S i . Since S i and B i are conditionally independent given X i , we can derive that Secondly, for refining the skills, we leverage the global prior information by the skill knowledge graph G s = (E 1 , R, E 2 ) where E 1 and E 2 are the sets of head and tail entities and R is the set of relations. Given the basic information B i and the skill knowledge graph G s , we obtain a set of skills where f is an invertible query function, which can ensure the one to one mapping relation between B i and O i . 2 The details of how skills are extracted are described in Section 2.
Thirdly, to fuse the local and global information, the probability P (Y i |X i , S i , B i ) during the text generation process is calculated as: (3) where λ is a hyperparameter that adjusts the balance of two probabilities.

Job Description Encoder
The input job description word sequence X i is first transformed into a sequence of word embeddings. To obtain the long-term dependency vector representation, we use a bi-directional LSTM (Schuster and Paliwal, 1997) as the text encoder. The input sequence is transformed into a hidden state sequence H = (h 1 , h 2 , ..., h m ) by concatenating the representations of the forward and backward hidden states . Specifically, the initiated encoder hidden state h 0 is a zero vector, and the last encoder hidden state h m is used for initiating the skill decoder.

Skill Prediction
Intuitively, the process of skill prediction is a Multi-Label Classification (MLC) task, which aims to assign multiple skills to each job description. To capture the correlations between skills, inspired by Yang et al. (2018), we view this MLC task as a sequence generation problem.
Formally, the skill decoder layer first takes the hidden state h m of the encoder as input, then derive a context vector C st by an attention mechanism (Luong et al., 2015) to help predict the skill labels.
where W 1 ∈ R d×d is trainable weight matrix, d is the hidden vector size. Inspired by Yuan et al. (2018), the job description is labelled with multiple skills by generating a skills sequence which joins the skills by delimiter <SEP> and has an unfixed number of skills (e.g., English <SEP> computer science <SEP> c++). The skill decoder is based on LSTM, whose hidden vector is computed by: Specifically, the last skill decoder hidden state g l is used for initiating the text decoder. The skill sequence is finally obtained by a softmax classification over the vocabulary of skills, V skill . In detail, a non-linear transformation is applied to form the skill decoder semantic representation I st , and then compute the probability P (S i |X i , B i ) via: where [; ] is vector concatenation, W 2 ∈ R d×2d , W 3 ∈ R |V skill |×d and b 3 ∈ R |V skill | are parameters.

Skill Refinement
The process of skill prediction only considers the local information, which results in some misusing of skills. To refine the skill of the generated job requirement, the global information is taken into account by the skill knowledge graph. The skill entities are divided into G and P as described in Section 2. Here, the basic assumption is that a generic skill appears more frequently than a professional skill among all the job postings, because the professional skill contains more domain characters. We use a hyperparameter θ as a threshold to divide the skills entities.
Given the basic information B i = (b p i , b s i ), the set of skills O i is obtained from the skill knowledge graph by the query function f . In detail, firstly, we obtain the set of entities that have the "N.T.M." relation with b p i and the set of entities who have the "IN" relation with b s i . Secondly, we get the intersection of the sets obtained in the first step. Finally, we keep the entities whose types are P. we embed O i as S i = (s i,1 , s i,2 , ..., s i,k ), and linearly combine it as a skill graph context vector C nd j by an attention mechanism: where W 4 ∈ R d×d are parameters, d is the dimensions of the word embeddings. Then a nonlinear transformation is applied to form the graph skill semantic representation I nd . The probability where g and C rd will be introduced in next section, W 5 ∈ R d×(2d+d ) , W 6 ∈ R |V skill |×d , b 6 ∈ R |V skill | are trainable parameters.

Job Requirement Generation
Job requirement generation fuses multiple attention mechanisms from three aspects, job descriptions, predicted skills and skills from skill knowledge graph. The text decoder, based on another LSTM, aims to generate final word sequence. The hidden vector of text decoder is computed by g t = LST M (e t−1 , g t−1 ), where e t−1 is the word embedding of the final generated target word at time step t − 1. After obtaining g, a nonlinear transformation is applied to form the text decoder semantic representation I rd . The probabil- where W 7 ∈ R d×2(d+d ) , W 8 ∈ R |Vtext|×d , b 8 ∈ R |Vtext| are parameters, V text is the vocabulary of job requirement and V skill is a subset of V text , both C rd and C th are the context vectors generated by attention mechanisms. Specifically, C rd is a context vector computed similar as C st because they directly take input sequence into account.
In addition, the skills S generated by skill decoder are fed into the text decoder to guide the generation process. To obtain C th , another attention model is leveraged: where W 10 ∈ R d×d are parameters.
The generation probability P ( as in equation 3. As shown in equation 8 and equation 10, the vector C th appears explicitly only in P local , which implies that P local puts emphasis on the skill prediction, i.e., the local information, while the vector C nd appears explicitly only in P global , which indicates that P global focuses on the skills given by skill knowledge graph, i.e., the global prior knowledge.
In this way, SAMA considers not only the local information from the job description but also the global information from the skill knowledge graph.

Training and Inference
The loss function of the model has two parts, the negative log-likelihood of the silver 3 skill labels, L S , and the gold 4 job requirement text, L Y : where µ is a hyperparameter, we give more weight to the loss of gold job requirement. During inference, the outputs of the skill decoder and the text decoder are predicted as follows: 3 The skill labels are silver standard, because it was not created by an expert but extracted by a trained model. 4 The job requirement text is gold standard, because it was written by human and put out online.
For each stage, we obtain the best results by utilizing the greedy search at each step.

Experiments
In this section, we conduct experiments to verify the effectiveness of SAMA.

Datasets
Job descriptions and job requirements are tokenized by Pyltp 5 word segmenter. Table 1 shows the split of the dataset. There are 468 position entities, 9 scale entities, 31,090 skill entities, and 310,413 relation edges in the skill knowledge graph. The vocabulary of job descriptions contains 14,189 words, the vocabulary of skills contains 3,523 words, and vocabulary the job requirements contains 18,612 words.

Comparison Models
To achieve the comprehensive and comparative analysis of SAMA, we compared it with two kinds of representative models: the standard generation model and the hierarchical generation model.
• S2SA: Seq2Seq with attention (Luong et al., 2015) is a standard generation model. • DelNet: Deliberation networks model (Xia et al., 2017) is a hierarchical generation model which has a two-pass decoder to generate and polish the same target sequence. • VPN: Vocabulary pyramid networks (Liu et al., 2019) is a hierarchical generation model which has the multi-pass encoder and decoder to generate a multi-level target sequence. • SAMA(w/o pred): SAMA(w/o pred) is a degraded model of SAMA that removes the process of skill prediction for the ablation test.
• SAMA(w/o graph): SAMA(w/o graph) is another degraded model of SAMA that removes the process of skill refinement.

Network Configuration
In all models, we pretrain word2vec (Mikolov et al., 2013) in the job posting dataset. We set the word embedding dimension as 100 and the hidden vector size as 400 in both encoding and decoding. We set  the maximum number of words in each sequence of skills and each job requirement as 30 and 150, respectively. Also, the weighted parameters λ and µ are set as 0.5 and 1.4, respectively. The threshold θ is set as 100. We apply dropout (Zaremba et al., 2014) at a rate of 0.3. Models are trained for 15 epochs with the Adam optimizer (Kingma and Ba, 2015), and the batch size is 5.

Evaluation Metrics
To evaluate the performance of SAMA, we employ the following metrics: Word overlap based metrics: To evaluate the overall text generation quality, we employ BLEU-N (Papineni et al., 2002) and ROUGE-N (Lin, 2004) as evaluation metrics, in which BLEU-N is a kind of precision-based metric and ROUGE-N is a kind of recall-based metric.
Skill prediction metrics: Since the correctness of generated skills is of great importance in JPG, we further evaluate the quality of skills in generated job requirements, using Precision, Recall, and F1 value. To achieve this, we extract skills in the ground truth and generated text by a matching method based on the skill vocabulary V skill .
Human-based evaluation: Since it is difficult to measure the comprehensive quality of the generated texts, i.e., both fluency of the texts and accuracy of the skills, in addition to automatic metrics above, we conduct a subjective evaluation following. Three graduated student volunteers are asked to evaluate the generated paragraphs. We randomly sample 50 pieces of data from the testing set. The job requirements generated by different models are pooled and randomly shuffled for each volunteer. Each generated paragraph is evaluated as bad (irrelevant skills or disfluent sentence), normal (basic relevant skills and fluent sentence), or good (rich and relevant skills and fluent sentence). Table 2 shows the results of word overlap based metrics. In terms of BLEU-N and ROUGE-N, SAMA performs the best in all word overlap based metrics, which suggests that our model obtains more overlapped words with the ground truth. SAMA(w/o graph) and SAMA(w/o pred) obtain competitive results, and both are significantly better than baselines, which demonstrates the effectiveness of skill prediction and prior knowledge of skills, respectively. In addition to the overall metrics, Figure 4 further demonstrates the skill-level metrics. Figure  4 demonstrates that the job requirements generated by skill aware models (SAMA(w/o pred), SAMA(w/o graph), and SAMA) consist of more accurate and richer skills than those generated by the baselines (S2SA, DelNet, and VPN). Among them, SAMA achieves the best performance. Besides, SAMA(w/o graph) obtains a higher recall rate, which demonstrates that it can enrich the skill information effectively. SAMA(w/o pred) obtains a higher precision rate, which demonstrates that it can refine the skill information effectively.

Human-based Evaluation
Results of the human-based annotation are shown in Table 3. it can be seen that skill aware models obtain more relevant and informative results (good results) than the baselines, and SAMA obtains the most "good" results and the least "bad" results. The results are consistent with the automatic metric results. S2SA obtains the most "normal" results. This is because S2SA contains less rich and accurate skills in job requirements although with a good fluency. DelNet and VPN obtain a large percentage   of "bad" results mainly because of the repeated sentences. Besides, SAMA(w/o pred) and SAMA(w/o graph) are both much worse than SAMA on "good" results. This is because SAMA(w/o pred) misses some skills, and SAMA(w/o graph) misuses some skills. All models have the kappa scores around 0.4, indicating that evaluators reach high agreement.

Visualization Analysis
When the model generates the target sequence, there exist differences in the contributions of different words. SAMA can synthetically select the most informative words by utilizing the three attention mechanisms. Figure 5 6 shows the visualization of three attention mechanisms. According to Figure 5, when SAMA generates the skill "EA (Environmental Art)", it automatically assigns larger weights to more informative words in three sources, e.g., 'interior' of X, 'interior, design, construction, matching' of O, 'interior, design, drawing, management' of S. It shows that SAMA can consider the different contributions and capture the most informative words automatically from multiple sources.

Case Study
To illustrate the difference in quality between SAMA and the compared models, we give an example of the generated text in Figure 6, where we 6 Due to the space limitation, we intercept some texts. 1. High school education above, more than 1 year of sales experience, sales management is preferred; 2. Working experience in gift groupbuying terminal customer service system, familiar with gift sales are preferred; 3. Team spirit, can bear high working pressure. S2SA Output: 1、高中及以上学历，市场营销等相关专业；2、2年以上销售行业 工作经验，有铝艺门窗或建材行业销售经验者优先。 1. High school education or above, marketing and other related majors; 2. More than 2 years working experience in sales, sales experience in aluminum doors and windows or building materials industry is preferred. Figure 6: Case Study. We translate Chinese to English. Skills in bold print are the correct and accurate skills. The underlined skills are the correct but inaccurate skills. The italic skills are the incorrect skills.
compare SAMA with the strong baseline S2SA. As shown in Figure 6, SAMA captures all three aspects the same as ground truth, while S2SA misses the third aspect. Besides, in every aspect SAMA generates more correct and accurate skills, while S2SA obviously performs not good enough and generates inaccurate skills. Generally, the main consideration of job seekers is the skills they need to master, such as Python, English, and Go Language. Therefore, although S2SA generates some right words, like "preferred", it does not increase the quality of the generated text because it generates inaccurate skills.

Parameter Analysis
We show how the two key hyperparameters of SAMA, λ and µ, influence the performance in Figure 7. The hyperparameter λ adjusts the balance of the probabilities between P local and P global and µ adjusts the balance between two losses, the loss in skill prediction L S and the loss in job requirements generation L Y .
The value of hyperparameter λ varies from 0.1 to 0.9 and bigger value implies more global prior knowledge of skills. Figure 7 shows that the performance gets a peak when the λ increases. It is intuitive that prior knowledge can help generate  accurate and rich skills. However, the too large value may sacrifice the fluency.
The value of hyperparameter µ varies from 1.1 to 2.0. We give greater weight to the loss of job requirements generation for the reason that it is the target of the JPG task. As observed in Figure 7, a weight close to 1 may introduce noises from the skill labels. Besides, when the weight continuously increases close to 2, the model is incapable of fully considering the skill labels.

Related Work
The related works fall into two categories, human resource management and generation models.

Human Resource Management
Human Resource Management (HRM) is an appealing topic for applied researchers, and the recruitment is a key part of HRM. With the explosive growth of recruiting data, many studies focus on the efficient automatic HRM, e.g., person-organization fit, intelligent job interview, and job skill ranking. Lee and Brusilovsky (2007) designed a job recommender system with considering the preferences of both employers and candidates. Qin et al. (2019) proposed a personalized question recommender system for job interview to better interview the candidates. Naim et al. (2015) analyzed the videos of interview for quantifying verbal and nonverbal behaviors in the context of job interviews. Sun et al. (2019) studied the compatibility of person and organization.  proposed a data driven approach for modeling the popularity of job skills. Besides, some augmented writing tools, such as Textio 7 and TapRecruit 8 , are developed to assist the HR to write job postings in the way that assuming a draft as input and then polishing the draft.
In this paper, we also consider improving the efficiency of HRM from the perspective of the job posting writing which is the crucial first step in the process of recruitment. 7 https://textio.com/products/ 8 https://taprecruit.co/

Generation Models
Many practical applications are modeled as generation tasks such as keyword extraction, headline generation, and response generation. Many generation tasks are formulated as Seq2Seq learning problems. Plenty of studies focused on the optimization of the Seq2seq model. For example, Lopyrev (2015) trained a Seq2Seq model with attention for headlines generation task. Xing et al. (2017) incorporated topic information into Seq2Seq by a joint attention mechanism to generate informative responses for chatbots. Meng et al. (2017) applied a Seq2seq model with a copy mechanism to a keyword extraction task.
However, models without explicit modeling the sentence planning have a great limitation in generating complex argument structures depending on hierarchy. Dong and Lapata (2018) decomposed the semantic parsing process into sketch generation and details filled-in and proposed a structure-aware neural architecture.  formulated outline generation task as a hierarchical structured prediction problem and proposed HiStGen. Puduppully et al. (2019) proposed a two-stage model which incorporates content selection and planning, for the data-to-text generation task.
Similar to the above researches, we proposed a hierarchical generation model, namely SAMA, which first labels the job description with multiple skills and then generates the job requirement paragraph, to tackle the JPG task. Different from prior arts, SAMA considered the global information across the whole dataset to generate high quality job requirements.

Conclusion
In this paper, we proposed the job posting generation (JPG) task and formalized it to a conditional text generation problem. Besides, we proposed a novel model, SAMA, for this task. The merits of SAMA come from three aspects. Firstly, it decomposed the long text generation into two stages, including an MLC task and a multiple skills guided text generation task. Secondly, it considered both the local and the global information to generate accurate and rich skills. Last but not least, the learned mapping relationships can be applied to various downstream tasks, such as automatic resume, and person-job fit. Extensive experiments conducted on real-world job posting data demonstrated the effectiveness and superiority of SAMA.