Fine tuning use cases - uw-ssec/llmaven GitHub Wiki
Use case: Improve retrieval performance of RAG scenarios by training a custom embedding model on a scientific domain (e.g. Astronomy)
Dataset(s): arXiv papers References: https://arxiv.org/abs/2309.06126
5/28 notes, next steps
- Define use cases for common open source taks (e.g. improve repo structure, add pypi publishing, add linting, etc)
- Select a base model and evaluate how well it can do on the use cases
- Define evaluation metrics and create evaluation set
- Based on gaps (if any!) plan what training data is needed
- Training set ideas:
- Issues -> PR as task completion examples (with code as context)
- Quality filters: Many OSS repos are not high quality, filter by stars, badges, presence of unit tests, etc.