Home - Gnurro/FinetuneReFormatter GitHub Wiki
Welcome to the FinetuneReFormatter wiki!
Features
FinetuneReFormatter offers multiple modes to prepare training data for GPT (or other LM) finetuning/training:
- SourceInspector mode, which comes with a text editor and tracking/finding of multiple common issues of raw scraped text data
- InitialPrep mode, which can be used to calculate various text statistics, like word count and token distribution, as well as conversion to a rolling context data format saved as JSON, called ChunkFile, and a few quick data tweaks
- ChunkStack mode, which can be used to view and edit ChunkFiles and helps with building rolling context text data
- ChunkCombiner mode, which can be used combine ChunkFiles into proper training data text and allows additional batch formatting determined by chunk types
- TokenExplorer mode, which can be used to check the token dictionary for peculiarities