Research - sshucks/cvmatcher GitHub Wiki
General Approaches
CV Retrieval System based on job description matching using hybrid word embeddings
- Uses pre-trained embeddings (CBOW - Word2Vec)
- Trains own embeddings (domain-specific corpus)
- Aligns the dimensions with PCA to the pre-trained embeddings
- 3 methods for combining (sum, weighted sum, selection) Own embeddings were only trained on 5 different jobs -> if matched with such a job -> own-trained embeddings, otherwise pre-trained embeddings
self-trained embeddings only improve the score for already known cvs
Pre-trained model on ESCO:
- Multilingual
- Domain-specific and therefore more powerful than simple embeddings
https://huggingface.co/jjzha/esco-xlm-roberta-large
LLMs (Mixtral (8x7B))
https://www.ijcai.org/proceedings/2024/1011.pdf
Workflow with ESCOXLM-R and Mistral LLM
LLM extracts skills and competencies from CVs, these extracted skills are then labeled with ESCOXLM-R. The job offer is labeled directly with ESCOXLM-R. The matching is then carried out again by LLM, which can also describe the assessment textually.
Multilingual model -> was mainly tested with English and Spanish.
Language independence:
Language agnostic
There exist multilingual approaches, which do not need language uniform data:
- Embeddings: using multilingual embeddings is a language agnostic approach (for example with models like ESCO-XLM-Roberta-Large)
- LLM: LLMs can be used for tasks like Skill extraction, multilingual LLMs exist (ex. Mixtral 8x7b), but can benefit from explicitly knowing the language (ex. https://arxiv.org/abs/2402.03832 uses English prompts but includes the language of the data (example CV or job description) in the prompt)
Language detection
https://medium.com/@monigrancharov/text-language-detection-with-python-beb49d9667b3 In python there are multiple packages that can be used to detect language of a text
pop weighted means recall for each language is multipled by its number of speakers. Average recall (probability of correctly identifying the language) of fasttext is best in all pop weighted categories but its slower than cld2. Cld2 is a c++ library, but there is a python bindings library which makes it usable in python (pycld2).
Own experiments, to try out pycld2:
Easy to install (pip install pycld2), immediately worked, tested on 12 german cvs -> all correctly classified, takes basically no time (on my machine 0.26 ms per cv)
Machine Translation
Another approach would be to translate to a target language for processing (in our case either English or German), this would give the benefit of more models to choose from. The embeddings below work either on English or German but aren't multilingual, so this would be the only way to use those for multiple languages.
Risks of Machine Translation:
Machine translation isn't always accurate. In our case the biggest risk is a too literal translation of "normalized Terms". For example: "master of science" - "Meister der Naturwissenschaften" or "mittlere Reife" - "medium maturity".
DeepL has an API for translation -> but it is costly, with the free version we could only translate 500 000 characters, which gets us around 100 cvs -> not enough.
There are python libraries for translating text (pip install deep-translator), those use diverse online translators for example google translate -> can those be used for our data? (data security)
deep-translator: Automatic language detection, can translate batches, support for multiple different online translators as well as chatgpt, ToDo test how well it works
Explainability
Score doesn't need to be 100% understandable but should be somewhat understandable.
- LLMs (Mixtral (8x7B)) can textually describe why a candidate is a good fit -> seems understandable for user but is actually a black-box-model
- Calculating score with a weighted sum of different scores (like currently) -> understandable as far as understanding whether for example skills or professional experience align more with the requirements however the single scores are also not fully understandable for a user (embeddings, distance, ...)
Embeddings
Comparative Analysis: BERT vs. JobBERT, SkillBERT, CareerBERT
1.Google BERT (bert-base-german-cased)
Description: A general-purpose BERT model pre-trained on a large corpus of text in German. It is not specialized for any specific domain, including job data. Use Cases: Suitable for a variety of natural language processing tasks but may lack precision in specialized contexts like job matching.
2.SkillBERT
Description: A specialized BERT model designed specifically for classifying professional skills based on job postings. Training Data: Trained on 700,000 job postings and 2,997 individual skills. Advantages: Higher accuracy in classifying professional skills. Differentiates between "core" and "fringe" skills. Source: Paper: SKILLBERT: "SKILLING" THE BERT TO CLASSIFY SKILLS Link: https://openreview.net/pdf?id=TaUJl6Kt3rW
3.JobBERT
Description: A neural model focused on understanding and normalizing job titles to standardized formats. Working Method: Enhances a pre-trained BERT model with skill label co-occurrence information extracted from job postings. Advantages: Utilizes distant supervision, eliminating the need for large datasets of manually labeled job titles. Significant improvements in job title normalization tasks. Source: Paper: JobBERT: Understanding Job Titles through Skills Link: https://arxiv.org/pdf/2109.09605
4. CareerBERT
Description: Tailored for job recommendation systems, focusing on matching résumés to ESCO jobs. Training Data: Combines ESCO taxonomy and job postings from European Employment Services (EURES). Advantages: Creates a shared embedding space for matching résumés and jobs. Outperforms traditional embedding methods in job matching tasks. Source: Paper: CareerBERT: Matching resumes to European Skills Link: https://www.sciencedirect.com/science/article/pii/S0957417425006657?ref=pdf_download&fr=RR-2&rr=946d1b80f8f25aab
Scoring Methoden
Modelle
I examined the models from two perspectives: Similarity Metrics and Deep Learning.
Comparison of Models and Similarity Metrics for CV-Job Matching
1. Cosine + BERT
Description:
Uses the BERT model to convert text into vectors (embeddings), and then calculates the similarity between CV and Job Description vectors using Cosine Similarity.
Advantages: • Results are explainable. • Fast and lightweight to execute.
Limitations: • Less accurate compared to newer models like SBERT. Reference: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks https://arxiv.org/abs/1908.10084
2. SBERT (Sentence-BERT)
Description: An optimized version of BERT for sentence comparison. Unlike regular BERT, its output vectors are directly comparable, without needing separate fine-tuning.
Advantages: • High accuracy. • Explainable results. • Supports pre-trained models. • Reference: SBERT Paper https://aclanthology.org/D19-1410/
3. Siamese Network
Description:
A neural network architecture that processes two inputs in parallel to learn their similarity. It can be trained to match CVs and job postings.
Advantages: • High performance, especially after training.
Limitations: • Requires a large amount of data for training. • Time-consuming to implement and test.
Reference: Siamese Network • https://en.wikipedia.org/wiki/Siamese_neural_network
4. ESCO-XLM-R
• Description: A multilingual Roberta model trained with EU standard occupational data (ESCO). It is used for skill annotation in CVs and job postings.
Advantages: • High accuracy, especially for occupational data. • Suitable for multiple languages.
Limitations: • Requires GPU for execution. • Reference: ESCO-XLM-R on HuggingFace https://huggingface.co/jjzha/esco-xlm-roberta-large
5. Mixtral (8x7B) LLM
##Description: An advanced language model (LLM) capable of extracting skills and abilities from CVs and analyzing semantic similarity with job descriptions.
Advantages: • Explainable output in text form ("why this CV is suitable").
Limitations: • Requires very high computational resources. • Slower processing speed.
Reference: Mixtral Paper https://arxiv.org/abs/2401.04088
Conclusion Several models were compared for matching CVs with job descriptions. While Cosine + BERT is simple and explainable, SBERT and ESCO-XLM-R offer better performance and still remain interpretable. For best results, a hybrid of SBERT and ESCO-XLM-R is recommended
Table 1: Comparative Analysis of Deep Learning Models for Semantic Matching Tasks
Comparison of Similarity Metrics: Cosine Similarity, Euclidean Distance, and Manhattan Distance:
Data clustering: 50 years beyond K-means
Table2
Data Mining: Concepts and Techniques
Cosine Similarity is ideal for measuring angular similarity, particularly in text mining tasks, because it normalizes for vector magnitude. Euclidean Distance is best for straightforward numeric comparisons but may distort results if data are not properly scaled. Manhattan Distance offers robustness against outliers and is well-suited for high-dimensional or grid-based datasets. Cosine Similarity is well-suited for our project because it effectively measures the semantic similarity between vectorized representations of resumes and job requirements generated by the BERT model, focusing on meaning rather than magnitude.
Ontologies
ESCO
(https://esco.ec.europa.eu/de)
"European Skills, Competences, Qualifications and Occupations" ist eine Ontologie für den europäischen Arbeitsmarkt. Sie ist aus 3 Säulen aufgebaut:
- Occupations
- Skills and competences
- Qualifications
The individual pillars are described using hierarchical taxonomies. Example: Occupations:
2 - Academic occupations
- 21 - Natural scientists, mathematicians and engineers
- 213 - Life scientists
- 2133 - Environmental scientists
- 213 - Life scientists
Skills and competences required to practice the profession are linked to each profession. There are also alternative labels for each profession that make it easier to assign them.
Each occupation/skill is identified by a URI, making it language-independent. There is at least one label for each entity in each of the 27 European languages.
Experiment
It was tested how many employment positions are recognized from the extracted CVs when matching via exact match with the help of the hiddenlabels and altlabels provided via ESCO. Only one eighth of all positions searched for were recognized (7/56). The synonyms from ESCO are therefore not sufficient here.
O*NET
O*Net is a US ontology for the labor market. It is divided into 6 categories:
- Worker Characteristics (Enduring characteristics that may influence both performance and the capacity to acquire knowledge and skills required for effective work performance) Worker Requirements (Descriptors referring to work-related attributes acquired and/or developed through experience and education) Experience Requirements (Requirements related to previous work activities and explicitly linked to certain types of work activities) Occupational Requirements (A comprehensive set of variables or detailed elements that describe what various occupations require) Workforce Characteristics (Variables that define and describe the general characteristics of occupations that may influence occupational requirements) Occupation-Specific Information (Variables or other Content Model elements of selected or specific occupations)
ESCO vs O*NET
ESCO is mainly limited to competences and skills, while O*NET also includes other factors such as interests or working style. O*NET is therefore more diverse, but this is not necessary in our context.
ESCO is developed for Europe, O*NET for the US, the priorities and requirements may differ.
ESCO supports 27 languages, O*NET is only available in English
A mapping from ESCO to O*NET was developed, where at least one O*NET profession was mapped to each ESCO profession (https://esco.ec.europa.eu/en/about-esco/data-science-and-esco/crosswalk-between-esco-and-onet). This would also make it possible to use both ontologies. For example, first ESCO for language inclusivity and then mapping to O*NET for extended properties and requirements.
ELM: Europass Learning Model Ontology
https://europa.eu/europass/elm-browser/documentation/rdf/ontology/documentation/elm.html#/
The ELM is an ontology for the educational sector. It can be linked to ESCO.
Parsing der CVs
https://aclanthology.org/C18-1326/ A Survey on Open Information Extraction
I analyze this task from two perspectives: research and the testing of CVs:
Test-analysis Table:
TBD
1. Degrees (Abschlüsse) Are Omitted
One key issue with Fecher-API is its inability to consistently extract academic degrees (e.g., "Bachelor of Science"). The paper discusses the limitations of Named Entity Recognition (NER) in handling diverse and complex entity types in unstructured web data: “Named Entity Recognition (NER) is unsuitable to target the variety and complexity of entity types on the Web.” → Page 1, Lines 59–63 This suggests that standard NER tools may fail to recognize that terms such as “B.Sc.” or “Studium der Informatik” refer to academic qualifications.
2. Skills Are Incorrectly Assigned
The paper emphasizes that many OpenIE systems rely on shallow linguistic methods, such as part-of-speech tagging, instead of deeper semantic analysis: “…such tools should be avoided in favor of shallow parsing methods such as part-of-speech (POS) taggers.” → Page 1, Lines 63–67 This indicates that systems like Fecher may misclassify skills such as “Python” or “Teamwork” due to a lack of contextual and semantic understanding.
3. Multi-Column CVs Cause Extraction Errors
While the paper does not directly mention multi-column CVs, it does highlight that most OpenIE approaches are designed for flat, simple text and perform poorly on structurally complex documents: “…recent approaches frequently rely on the output of a dependency parser … thereby hurting the domain-independence and efficiency assumptions.” → Page 2, Lines 8–11 This implies that layout-sensitive formats, such as multi-column résumés, likely fall outside the processing capabilities of standard OpenIE-based APIs like Fecher.
Solution:
https://openbiomedicalengineeringjournal.com/VOLUME/18/ELOCATOR/e18741207289680 Exploring Biomedical Named Entity Recognition via SciSpaCy and BioBERT Models
TBD