Possible sites things to fine tune the AI with - VBPXKSMI/Wiki-Test GitHub Wiki
================================================================
MOST IMPORTANT LINKS
================================================================
Vanilla GPT-2 models + 2 CYOA data (the original and the lewd one posted by an Anon) and the base game AI Dungeon model_v5 you get from the torrent just as a back up.
Some of the models made by Anons that were converted to PyTorch.
================================================================
SECUNDARY LINKS
================================================================
Possible sites / things to fine-tune the AI with:
Porn related:
Here is 280 MB (~60M words) of erotic fanfiction that can be fed as training/fine-tuning data; cleaned, formatted and tagged to the best of my abilities.
Models trained on HTTYD, Spyro Fanfics, some weird smut with dragons, some beast stuff and Overlord: Dragon model v1.0 Reccomendations: Set timeout to 240 or 380. Do not use word slit. It'll hang for a long ass time and break. Just trust me. That model wasn't trained that long with 1558M, so it'll either shit out gold or will give you cancer. Have fun and don't forget to make backups of your original models. It was trained on 250 steps with batch size of 4 only. Dragon model v2.0 It might work better for some. But might work worse for some. I know my older one was better for sure at doing what I want.
5MB of curated and formatted smut CYOA, a mix of fantasy and modern with many different fetishes. Taken from CYOA sex stories. Lots of non-sexual events, battles, fights, magic, world building, and other nonsense mixed in with sexual fun. 90% fantasy and 10% modern. Might make a good addition to someone's smut training file.
================================================================
Other sites / non porn related things:
Almost 500MB of (nearly entirely) properly formatted CYOA data, including a fair amount of lewd stuff (taken from http://editthis.info/create_your_own_story/, all the stories from the site into a .txt) (this data is unusable due to bad formatting).
Example of a formated text an Anon made (not proofreaded): https://pastebin.com/tFh0jhpX
================================================================
IDEAS, SITES, NON-DIRECT DOWNLOAD LINKS AND OTHER RANDOM THINGS COMPLETELY UNSORTED
================================================================
Possible sites / things to fine-tune the AI with:
Porn related:
Back when I was into text adventures I remember there were groups for "Adult Interactive Fiction" (AIF), I wonder if the text from these games can be extracted and fed into the algorithm: http://www.ifwiki.org/index.php/Category:Porn/Erotica
asstr.org
https://editthis.info/create_your_own_story/All_Adult_Stories
Flexible Survival
Grey Archive
Marjorie's Bedtime Stories
Miranda scenes from Fall of Eden
================================================================
Other sites / non porn related things:
Look into "Elliquiy". Kind of a pain in the ass to get into, requires an application. Once in, there are many years worth of pretty hardcore forum-based roleplaying and collaborative writing. Much of it very dirty and a lot of it based in a variety of fandoms. The catch is that it's all in forum posts and not properly archived. But if I could pick one source to train the AI on, it would be Elliquiy just for the sheer amount of stuff it has.
I found this Reddit post of someone who apparently archived 400GB of fanfic.net stories: https://www.reddit.com/r/FanFiction/comments/3jlene/i_archived_nearly_every_story_on_ffnet/ Training on this would make the AI an expert at popular fictional universes. I imagine most of it is pretty low quality though. Got to weed out the shit. Half of it is literally shit that best thrown in the trash.
Logs from /vg/station SS13 server
A web archive of the BIONICLE legend, compiled and presented in an easy-to-read format,
Easy literature database to scrape: gutenberg.org
Lord of the Mysteries
There is a good CYOA series called "Endless Quest".
Pathfinder books.
About "gamerules" being incorporated into the training data, this site could be useful: http://textfiles.com/rpg/
A novelization of Planescape: Torment (written in first person, unfortunately).
I think both Anais Nin and Bataille both wrote second person stories,
> Idea (assuming the stories are of high quality):
(1) Decompile all stories made by Anons into training data (could use Regex to find common misspellings and fix them)
(2) Retrain models on Anons stories
(3) ????
(4) PROFIT
> List of 2nd-person narrative books/stories that could be formatted:
(1) Juno Diaz, "Miss Lola"
(2) Lorrie Moore, Self-Help
(3) David Foster Wallace, Forever Overhead
(4) Eduoard Leve, Suicide
(5) Ron Butlin, The Sound of My Voice
(6) Andy Weir, The Egg
(7) Mohsin Hamid, How to Get Filthy Rich in Rising Asia (could potentially work)
(8) If on a winter's night a traveler (story with lots of 4th wall breaking)
Tails Gets Trolled
Slablands
Pessoa's "Book of Disquiet"
> Works from:
(1) Nabokov
(2) Michael Moorcock
(3) Stanislaw Lem
(4) Roger Zelazny
(5) Robert E. Howard
(6) Piers Anthony's The Incarnations of Immortality series
For scraping Reddit: the r/subsimulatorgpt2 guys did just that and there's also a big archive of millions of reddit posts I think they used for training that can be downloaded as a ".zip".