...or how far you can take measly 51 steps of training material and how I found out Euterpe has a thing for trolls...

The TL;DR

For generator modules with little training data, you might have to go quite a bit beyond 100% to get acceptable results.
Don't rely on the loss graph or loss value to decide on whether you used the right amount of steps
For lower step amounts, adding more training steps will rapidly help lower the chance to get garbage output. After a while the effect of more training steps isn't so strong anymore, but more training steps might still improve the quality of individual outputs. The effect of more steps might just be weaker compared to lower step amounts.
Overtraining does at some point lead to the model "obsessing" on certain topic areas. However, the aspects the model ends up fixating on seems to be an interaction of the module and the information already in the model (base training & fine-tune). The fetish you'll end up with can therefore not be predicted beforehand just from the training data alone and needs careful observation.

Method

Training material

Material from Forbidden Lands, a Swedish open world dark fantasy TTRPG. Trained on the 42 random encounter from the gamemaster's guide. See here for the actual module and details. In general, the training material consisted of short snippets of encounters with a fairly similar structure.

Excerpt from training data

POTENTIAL SPOILER WARNING: Don't click if you are playing in a Forbidden Lands campaign!

***
[ Category: Characters, Humans, Ambush, Lure, Hostile ]
[ Biome: Plains, Forest, Hills, Mountains, Quagmire ]
DESCRIPTION: A terrible scene unfolds in front of you. A dozen humans are crawling around with their innards spilled out, some still alive, soaked in blood. A cart with merchandise is overturned. A few people are moaning, others screaming. A woman in chainmail is trying to lift a sword. When a comparatively healthy young man sees you, he screams hysterically.
"Robbers! T hey took our horses. Do you have water? Is anyone a healer? No, brother, don't fall asleep!"
GM NOTES: The scene is, in fact, completely staged and no one is hurt. A group of slavers have poured blood and intestines from animals on to and around themselves, hoping that the adventurers will approach without their weapons drawn. The leader, Harwa, has a bronze horn and when he blows on it, the entire group is meant to immediately leap up and attack their "saviors." Their goal is not to kill, but to put the adventurers out of action using cudgels, and then sell them as slaves, possibly at Grindbone village (see the Raven's Purge campaign book). If anyone manages to flee, the slavers will not pursue them, instead just proceed to secure those they have caught.
***

Tested Training Steps & Sample Size:

Steps	125	250	375	500	625	1k	2k
Percentage of total steps to train	245%	487%	732%	976%	1,225%	1,961%	3,922%
Sample Size (rated outputs)	50	51	55	55	51	50	50

Measure

On a fully empty prompt, a single "encounter" was generated or output was produced until the page was filled. The output was then rated on how useful it would be as a GMing tool. It did not have to be perfectly usable right away, just giving interesting ideas to use for an encounter would be fine.

Rating	Meaning
1	Garbage: Incoherent, not an actual encounter, etc.
2	Poor: Almost garbage, but maybe some hints at usable ideas and at least some encounter structure
3	Meh: Some base ideas could be reused from the output with some revision
4	Good: Has some good ideas, could be used in a campaign after some work on the encounter
5	Great: Has cool ideas that could be used in a campaign with just a little bit of tweaking

For reference, I would consider this output garbage:

I had hoped that we'd never meet again. You should've let that goblin go. Goblins have been living under my ground for years now. Halflings, Orcs, Humans, Elves. Goblins only.
I can provide more than enough food for Trolls who've been on a journey without food for days. Orcs, Halflings or Humans have all brought me something, and Elves tend to go to the Elf Village, so I can only provide food if I know they're coming. Halflings, Elves or Orcs who don't live in Trolls territory have to pay me, Dwarves. Humans, Goblins and Ogres are always free.
It's like the Elves have said for years. Trolls kill Goblins. Humans, Halflings Elves Orcs Dwarves can go where they want. Ogres, Humans, Halflings or Humans are the one's who have to pay.
You're Undead. Trolls don't eat Undead. Goblins don't eat Humans Orcs Dwarves Ogres Elves Halflings or Humans don't eat Undead.

...and this great:

[ Category: Characters, Female Monsters ]
[ Biome: Quagmire, Forest, Marshlands, Mountains, Ruins ][ Category: Neutral Places, Quests, Encounters, Ruins ]
DESCRIPTION: An immense house has been constructed out of the bones and flesh of a dozen people. The building is surrounded by two large lakes filled with blood. Occasionally, a small figure can be seen in one of the windows or through the window panes near to where the water boils violently.
GM NOTES: This house is built on top of the remains of a village that once lived here. In the center stands an enormous skeleton covered with ragged clothing. It seems as though it is trying to reach something at its feet, but it cannot move because of its size. A few more skeletons lie scattered around the house and are wearing clothes. The house itself is not very stable and the bones of many people have broken off from it. Sometimes the house shakes when someone walks across it and other times it falls apart completely.
The home was created by a powerful undead sorceress named Rulorina, who used her magic to force the villagers into servitude. She forced them to build this house for her own use. Unfortunately, she also had a taste for human flesh and decided to feast upon the villagers' victims after they were dead, thus creating the lakes of blood. Once the bodies were drained of their blood, she used magic to bind the corpses together and create this house. When Rulorina died, the house collapsed and fell down on the village below, burying it under tons of rubble and bones.

Data

Data can be found here.

Findings

If you are interested in the full statistical analysis (you nerd!), have a look at this Jupyter notebook

Otherwise, just read on for the summary down below.

Note that findings are for a very specific generator module! Findings very likely don't apply to story modules. And until we have more similar studies, it is even unclear how much of this apply to other generator modules as well.

First Look

The scores per step amount were distributed like this:

So 125 Steps (245%) produced garbage, basically. However, output improves pretty quickly after that; 375 steps is already quite alright. Higher step amount seem to do somewhat better still.

So, first lesson: For generator modules with little training data you might have to go quite bit beyond 100% to get acceptable results.

If you were wondering whether the loss graph could be used to see where output starts to become usable...

The answer is no:

I don't see a clear cut-off point - do you?

Model That Shit!

I ran two statistical models to see what output quality we should expect per training steps amount based on the data we have:

The first model predicted the chance for poor or worse output based on step amount
The second model predicted the chance for good or better output based on step amount

Results can be seen in this graph:

The chance to get a bad output starts pretty high at low step amounts, but decreases quickly until about 750 steps. After that, increasing training steps does not do much to your chance to get a poor output. (Don't read too much into the uptick after 1k steps. I have no data for step amounts between 1k to 2k so the curve won't be very precise in that area).

The chance to get a good output simply increases slowly and steadily the more training steps you add. I suspect that this should actually level off after about 1k steps, but I don't have much data after 1k steps, so the model might not detect that. The main point, however, is that at least until 2k steps (3,922%) adding more steps doesn't seem to hurt the quality of individual outputs and might even to continue to further improve output.

So if adding more steps did not seem to hurt and at best even continued to improve output... We might want to use 2k steps or even try and see if output still improves if we add more steps right?

Well, not so fast young padawan...

All This Training and (Almost) no Data Makes Euterpe go Trololololololololololo

The problem is, Euterpe has a troll fetish. No really!

While the outputs from the 2k steps module are alright on their own, once you have seen only a handful of them you begin to notice a lack of variety.

Here are 5 random retries on empty prompt, generated with the 2k module:

Click here to be trolled...

[ Category: Characters, Trolls ]
[ Biome: Mountains ]
DESCRIPTION: A group of trolls are walking along the path. They carry heavy bags on their backs and have axes slung over their shoulders.

[ Category: Characters, Trolls ]
[ Biome: Mountains ]
DESCRIPTION: The sound of heavy footsteps is followed by a deep rumbling. It sounds like an avalanche in the distance.

[ Category: Characters, Trolls ]
[ Biome: Mountains ]
DESCRIPTION: You hear a strange sound and look up to see the head of a troll rolling down from above. In his hand he holds a large rock that he throws at you.

[ Category: Characters, Humans ]
[ Biome: Dark Forest ]
DESCRIPTION: The sound of a flute is heard as you pass by. A young man with long hair and wearing rags sits on the ground.

[ Category: Characters, Elves ]
[ Biome: Mountains ]
DESCRIPTION: You hear the sound of a flute. The melody is strangely sad and melancholy. And then you see them.

Noticed a pattern?

3/5 outputs contain trolls. That is actually somewhat worse than what you would get on average, but the amount of trolls in the output is still very noticeable with the 2k module
4/5 modules start with you hearing some kind of sound.

And no, these kind of obvious pattern do not occur on 1k steps or below. Seems like on 2k, Euterpe is obsessing on a few specific topics.

How Many Trolls are we Talking?

The pattern with the sounds is a bit more tricky to quantify, so we will focus on the trolls from here on. Thanks to NAI's awesome Token Probability Viewer the troll invasion is fairly easy to quantify:

Start the prompt with [ Category: Characters,
Do one generation
Then use the Token Probability Viewer to see the probabilities for the first generated tokens. (Using the "Before" probabilities - these are influenced by the module, but not by generation settings)

We can do the same for the other step sizes to see how this develops
(Note: For smaller step sizes, Troll is sometimes not under the top 10 tokens. However, tokens are sorted based on the "After" column. So I just used a bias for Troll to ensure the probability for this token is always displayed.):

Obilgatory meme insert

Okay so why all the Trolls?

"Well", I hear you say, "clearly you overtrained with 2k steps and now the module is just copying the patterns it sees in the training data. Probably your training data just had a lot of passages with the token Troll."

Thank you dear reader for that thoughtful comment I put into your mouth. That would indeed be a very good theory, except that it does not hold up.

The training data had 41 meta-tagged encounters. Of those, 28 started with the meta-tag [ Category: Characters,.

Of those 28 meta-tags, 3 actually continue with [ Category: Characters, Trolls. There are no more occurences of the token Troll btw (only of troll and trolls, which are different token). So if Euterpe was just copying the patterns it saw in the training data, the expected probability of the token Troll should be 3/28 = 10.71%.

You might have noticed that the 2k module overshoots that by quite a bit. So important lesson here: Overtraining does at some point lead to the model "obsessing" on certain topic areas. However, the aspects the model ends up fixating on seems to be an interaction of the module and the information already in the model (base training & fine-tune). The fetish you'll end up with can therefore not be predicted beforehand just from the training data alone and needs careful observation.

(Anecdotally, I observed a similar pattern when I experimented with this data on Sigurd. Only that Sigurd kept generating kobolds for some reason. The training data has 0 kobolds...)

The Other Denizens of the Realm

We can use the method developed above to see how the token probabilities develop for other selected sets of tokens
(Note: Here I again used a bias for the selected tokens to make sure that the token probabilty displays for all tokens)

Based on this data I finally settled for the 1k module. It seems to strike a good balance for getting the type of output I want from this module while staying far enough away from overtraining and Euterpe going mad with some weird troll fetish.

Module Study 1: Forbidden Lands Encounter Module - TravelingRobot/NAI_Community_Research GitHub Wiki

The TL;DR

Method

Training material

Excerpt from training data

Tested Training Steps & Sample Size:

Measure

Data

Findings

First Look

Model That Shit!

All This Training and (Almost) no Data Makes Euterpe go Trololololololololololo

How Many Trolls are we Talking?

Okay so why all the Trolls?

The Other Denizens of the Realm

⚠️ GitHub.com Fallback ⚠️

Module Study 1: Forbidden Lands Encounter Module - TravelingRobot/NAI_Community_Research GitHub Wiki

The TL;DR

Method

Training material

Excerpt from training data

Tested Training Steps & Sample Size:

Measure

Data

Findings

First Look

Model That Shit!

All This Training and (Almost) no Data Makes Euterpe go Trololololololololololo

How Many Trolls are we Talking?

Okay so why all the Trolls?

The Other Denizens of the Realm

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️