Research - sshucks/cvmatcher GitHub Wiki
General Approaches
CV Retrieval System based on job description matching using hybrid word embeddings
- Uses pre-trained embeddings (CBOW - Word2Vec)
- Trains own embeddings (domain-specific corpus)
- Aligns the dimensions with PCA to the pre-trained embeddings
- 3 methods for combining (sum, weighted sum, selection) Own embeddings were only trained on 5 different jobs -> if matched with such a job -> own-trained embeddings, otherwise pre-trained embeddings
self-trained embeddings only improve the score for already known cvs
Pre-trained model on ESCO:
- Multilingual
- Domain-specific and therefore more powerful than simple embeddings
https://huggingface.co/jjzha/esco-xlm-roberta-large
LLMs (Mixtral (8x7B))
https://www.ijcai.org/proceedings/2024/1011.pdf
Workflow with ESCOXLM-R and Mistral LLM
LLM extracts skills and competencies from CVs, these extracted skills are then labeled with ESCOXLM-R. The job offer is labeled directly with ESCOXLM-R. The matching is then carried out again by LLM, which can also describe the assessment textually.
Multilingual model -> was mainly tested with English and Spanish.
Language independence:
Language agnostic
There exist multilingual approaches, which do not need language uniform data:
- Embeddings: using multilingual embeddings is a language agnostic approach (for example with models like ESCO-XLM-Roberta-Large)
- LLM: LLMs can be used for tasks like Skill extraction, multilingual LLMs exist (ex. Mixtral 8x7b), but can benefit from explicitly knowing the language (ex. https://arxiv.org/abs/2402.03832 uses English prompts but includes the language of the data (example CV or job description) in the prompt)
Language detection
https://medium.com/@monigrancharov/text-language-detection-with-python-beb49d9667b3 In python there are multiple packages that can be used to detect language of a text
pop weighted means recall for each language is multipled by its number of speakers. Average recall (probability of correctly identifying the language) of fasttext is best in all pop weighted categories but its slower than cld2. Cld2 is a c++ library, but there is a python bindings library which makes it usable in python (pycld2).
Own experiments, to try out pycld2:
Easy to install (pip install pycld2), immediately worked, tested on 12 german cvs -> all correctly classified, takes basically no time (on my machine 0.26 ms per cv)
Machine Translation
Another approach would be to translate to a target language for processing (in our case either English or German), this would give the benefit of more models to choose from. The embeddings below work either on English or German but aren't multilingual, so this would be the only way to use those for multiple languages.
DeepL has an API for translation -> but it is costly, with the free version we could only translate 500 000 characters, which gets us around 100 cvs -> not enough.
There are python libraries for translating text (pip install deep-translator), those use diverse online translators for example google translate -> can those be used for our data? (data security)
deep-translator: Automatic language detection, can translate batches, support for multiple different online translators as well as chatgpt, ToDo test how well it works
Embeddings
SkillBERT
Source: SKILLBERT: "SKILLING" THE BERT TO CLASSIFY SKILLS!
Currently Used Model Model: bert-base-german-cased Type: General-purpose language model Limitation: Not specialized in professional or skill-related data → lower accuracy
SkillBERT – Specialized BERT Model for Job Data Trained on: 700,000 job postings & 2,997 individual skills Goal: Mapping skills to 40 competence groups
Advantages:
Higher accuracy in classifying professional skills
Differentiates between “core” and “fringe” skills
Supported by additional features such as similarity scores and clustering
JobBERT
JobBERT: Understanding Job Titles through Skills JobBERT is a neural representation model designed to understand and model job titles. Its main goal is the normalization of job titles—that is, converting various, often non-standardized job title formats into a standardized and comprehensible form. This is highly valuable for various HR processes such as online recruitment, internal organization, and the extraction of meaningful information.
How does it work? JobBERT is based on the assumption that skills are the essential components that define a job. The model creates vector representations of job titles by enhancing a pre-trained language model (such as BERT) with co-occurrence information from skill labels extracted from job postings. In other words, JobBERT learns to understand the meaning of a job title through the skills typically associated with that position.
A key feature of JobBERT is the use of a distant supervision approach. This means the model doesn’t require a large dataset of manually labeled job titles for training. Instead, existing job postings that contain job titles and skill descriptions are used to automatically generate training data.
The JobBERT model consists of a BERT-based encoder that generates vector representations of job titles. The mapping of a job title to standardized titles is based on proximity within the vector space of these representations. The model also includes a gating mechanism that allows it to ignore irrelevant words in the job title (such as locations) and assign more weight to the important ones.
Results and Achievements: The paper shows that JobBERT leads to significant improvements in the task of job title normalization compared to using generic sentence encoders. The authors have also introduced a new evaluation benchmark for job title normalization, enabling comparison and benchmarking of different methods.
In summary, JobBERT provides an effective and efficient solution to the problem of job title normalization. By understanding the skills associated with professions and using distant supervision, it overcomes the limitations of previous methods.
CareerBert
CareerBERT: Matching resumes to ESCO jobs
CareerBERT is an innovative approach to job recommendation systems. It helps generate more accurate and comprehensive job recommendations based on unstructured text data, such as résumés (CVs).
Core Approach: The model focuses on matching résumés and ESCO jobs (ESCO stands for European Skills, Competences, Qualifications and Occupations) in a shared embedding space. This is done by transforming both résumés and jobs (or ESCO job classifications) into vectors within this common space.
CareerBERT is based on the Sentence-BERT (SBERT) architecture, which itself is built upon BERT. SBERT is well-suited for generating semantically meaningful vector representations (embeddings) of sentences and longer texts.
Data Sources: For training, the article uses a combination of data from the ESCO taxonomy (which provides a standardized framework for describing occupations) and job postings from EURopean Employment Services (EURES) (which provide real-world examples of job descriptions).
Training Method: The model is trained to create representations of résumés and jobs in a shared embedding space, such that similar items (matching résumés and jobs) lie close together in the vector space. The article describes a method for generating the training dataset that involves combining job titles with related information from ESCO data (such as skills, synonyms, and descriptions). It also involves computing "job posting centroids" and "job centroids" from EURES data.
Results and Comparisons: In the experiments, CareerBERT—especially when using jobGBERT as a base model—outperformed traditional and some state-of-the-art embedding methods in matching EURES job postings to ESCO job descriptions. A human evaluation using real résumés and HR experts confirmed the effectiveness of CareerBERT in generating relevant job recommendations.
Bert Modell
Enhancing Job Matching Through Natural Language Processing: A BERT-based Approach Basierend auf diesem Artikel sind die folgenden nützlichen Informationen über BERT enthalten:
Improved Contextual Language Understanding: BERT enhances machines’ ability to understand language in context.
Training on Large Text Corpora: BERT has been trained on massive amounts of text, enabling it to capture complex details of language.
Versatile Applications in NLP: This property makes BERT highly versatile for various natural language processing (NLP) tasks.
Adaptability and Superior Performance: BERT’s main strength lies in its adaptability to specific needs, which allows for superior performance in targeted applications. This adaptability sets it apart from many other algorithms.
Precise Matching in Job Recommendation Systems: In job recommendation systems, BERT provides precise matches by encoding job descriptions and user résumés into semantically meaningful representations using deep learning models. This improves the relevance and quality of job recommendations.
Superior Performance in Evaluation: Compared to cosine similarity and Jaccard similarity techniques (as mentioned in this article), BERT has demonstrated overall superior performance. Its high recall rate and F1 score highlight its strong ability to accurately identify the most similar entities. BERT's cutting-edge contextual understanding and representation make it the most effective method for evaluating text similarity.
Use in Personalized Job Recommendations: The system presented in this article uses BERT to provide personalized job recommendations based on résumé data and trends in job descriptions.
Comparison of Models: BERT vs. JobBERT, SkillBERT, CareerBERT
This site—Hugging Face’s Models Hub—provides a comprehensive catalog of pre-trained language models. You can browse every available model and view detailed metadata, including the languages each model supports. Based on the supported-language information listed there, I have created the comparison table above.
Scoring Methoden
Cross-Encoder Data Annotation for Bi-Encoder Based Product Matching
Comparison of Cross-Encoder and Bi-Encoder for CV–Job Matching
In the current system, candidate CVs and job requirement profiles are compared using embeddings and cosine similarity — following a Bi-Encoder architecture. This method is fast and scalable but may lack precision.
Based on the study "Cross-Encoder Data Annotation for Bi-Encoder Based Product Matching" (EMNLP 2022), a hybrid strategy was proposed to combine the strengths of both models:
A Cross-Encoder, which jointly encodes both inputs and captures deeper contextual information, is used to annotate or refine training data.
This annotated data is then used to train a new Bi-Encoder, resulting in significant performance gains.
Key Insights: Without human-labeled data, Cross-Encoder annotation improves Bi-Encoder accuracy by +4%.
When human labels exist, combining them with Cross-Encoder predictions (intersection) improves accuracy by +2%.
The Cross-Encoder acts as a “teacher” model, enhancing training quality without sacrificing speed at inference time.
Recommendation: Use a Cross-Encoder to annotate new CV–job pairs and train an updated Bi-Encoder with this refined dataset to enhance both accuracy and scalability.
Modelle
TBC
Ontologies
ESCO
(https://esco.ec.europa.eu/de)
"European Skills, Competences, Qualifications and Occupations" ist eine Ontologie für den europäischen Arbeitsmarkt. Sie ist aus 3 Säulen aufgebaut:
- Occupations
- Skills and competences
- Qualifications
The individual pillars are described using hierarchical taxonomies. Example: Occupations:
2 - Academic occupations
- 21 - Natural scientists, mathematicians and engineers
- 213 - Life scientists
- 2133 - Environmental scientists
- 213 - Life scientists
Skills and competences required to practice the profession are linked to each profession. There are also alternative labels for each profession that make it easier to assign them.
Each occupation/skill is identified by a URI, making it language-independent. There is at least one label for each entity in each of the 27 European languages.
Experiment
It was tested how many employment positions are recognized from the extracted CVs when matching via exact match with the help of the hiddenlabels and altlabels provided via ESCO. Only one eighth of all positions searched for were recognized (7/56). The synonyms from ESCO are therefore not sufficient here.
O*NET
O*Net is a US ontology for the labor market. It is divided into 6 categories:
- Worker Characteristics (Enduring characteristics that may influence both performance and the capacity to acquire knowledge and skills required for effective work performance) Worker Requirements (Descriptors referring to work-related attributes acquired and/or developed through experience and education) Experience Requirements (Requirements related to previous work activities and explicitly linked to certain types of work activities) Occupational Requirements (A comprehensive set of variables or detailed elements that describe what various occupations require) Workforce Characteristics (Variables that define and describe the general characteristics of occupations that may influence occupational requirements) Occupation-Specific Information (Variables or other Content Model elements of selected or specific occupations)
ESCO vs O*NET
ESCO is mainly limited to competences and skills, while O*NET also includes other factors such as interests or working style. O*NET is therefore more diverse, but this is not necessary in our context.
ESCO is developed for Europe, O*NET for the US, the priorities and requirements may differ.
ESCO supports 27 languages, O*NET is only available in English
A mapping from ESCO to O*NET was developed, where at least one O*NET profession was mapped to each ESCO profession (https://esco.ec.europa.eu/en/about-esco/data-science-and-esco/crosswalk-between-esco-and-onet). This would also make it possible to use both ontologies. For example, first ESCO for language inclusivity and then mapping to O*NET for extended properties and requirements.