Study 1: Building the AI Story Scale - TravelingRobot/NAI_Community_Research GitHub Wiki

Table of contents

Goal
The TL; DR
    The AI Story Scale
    Impact of Presets
    Speculations on Generation Settings
Method
Results Summary
    Story Aspects
    Differences Between Presets
    Which Prompts are Easy or Hard for Euterpe?
    Are NAI Users “Different”?
That's all, folks (for now)!

Goal

To build a solid scale for human ratings of AI-generated stories, according to best psychometric practices. Why? Because AI research does not have one yet.

Because I am uncreative, I call the resulting questionnaire the AI Story Scale, AISS for short.

Click here for details on the motivation behind the study

Why do we need this anyway?

In the end, what we all truly care about when using NAI is getting coherent, well-written, creative stories. This is the reason we tinker with generator settings, author's note, biases, etc… But without a good measure of story quality, it is hard (if not impossible) to judge if and how these things really affect the output. Does increasing top k sampling by .05 impact coherence or creativity? Does it depend on the randomness setting? Or do all these parameters actually do nothing at all, and we are just debating placebo effects? If we had a good, well-constructed measure for human ratings, we could run actual studies on these things and put together how all these settings influence the story on a grand scale, piece by piece.

“But AI researchers already use human ratings for these things”

The tl;dr is AI researcher are extremely intelligent people, and papers on language models are usually way over my head. However, they are not psychometricians, and so their way of rating story quality is… often not great. Getting usable data from human raters require you to understand how they approach stories, and building your scale based on real data.

Here is the thing – measuring a complex concept like “fluency” or "interestingness" with a single question is terrible. If you are really interested in a longer discussion of this topic, see this article, for example.

But in a nutshell: The measures commonly used by AI researchers for rating stories frequently have serious weakness and would not probably not pass review in fields focussed on measuring human attitudes and opinions. Even if you happen to ask well-written questions (some are not, though), results will still be unreliable and have low predictability. That is just the fate of measuring complex, multi-faceted concepts with a single question. Single-Item measures are also terrible for many types of statistical analyses (even just reporting things like the mean or correlation is dubious). The fact that none of the commonly used questions did go through any preliminary testing to see if they work well, probably makes the whole issue even worse.

Furthermore, just because your theory says that the deciding aspects of AI-generated stories are fluency, relevancy, coherence or whatever, does not mean that the average human actually considers these categories when rating stories. The constructs used by human to assess stories might have a very different structure than your theory. Which is why you need to check your scale before you wreck your scale! Otherwise, you might happen to think you are measuring one thing, but you are really just measuring a bunch of nonsense (called low validity). Or you are measuring the thing you want to measure, but you picked questions that are so imprecise to be basically worthless (called low reliability).

loss graph

© Nevit Dilmen
Hey, look, you already learned the basics of psychometrics!

I regularly consult people on scale construction in my job, and I have never seen a purely theoretical scale survive real-world data. Humans are complicated…go figure.

The TL; DR

The AI Story Scale

  • This story revealed that ratings of AI-generated stories on the 73 included questions can be summarized with 5 story aspects: Coherence, creativity, avoiding repetition, pace and consistent characterization
  • Based on the data gathered in this study, I distilled the most 22 useful questions for measuring these story aspects with precision and efficiency. The result is the AI Story Scale (AISS)

Impact of Presets

  • I used the data on those story aspects to compare 8 of the 10 available presets for Euterpe (Moonlit and ProWriter could not be included for this study). Based on this analysis, I would make the following recommendations:
    • Note that these recommendations are for long, uninterrupted chains of story output with little user input. It is currently unclear if this advice also applies to users that steer the AI more heavily!
    • Avoid Genesis - it is less coherent than other choices with no known upside
    • Use Ace of Spades as your default preset. It avoids repetition better than the other tested presets and has no known downside.
    • If you are willing to spam retry, presets with low consistency can give you two more potential options:
      • If you want superior coherence, stay up retrying with All-Nighter: You will have to sort out more incoherent outputs than usual, but will also get more strongly coherent ones as usual
      • If you are looking for fast-paced stories, ride the retry button with Low Rider: You will get more outputs with low pace, but in exchange you also will see more fast-paced ones

Speculations on Generation Settings

  • An inspection of the settings of the presets gives some first ideas on their potential impact
    • The over-performance of Ace of Spades on avoiding repetition suggests that high values on repetition penalty range seem more effective at controlling repetition. It might be worth experimenting more with a strong repetition penalty range.
    • The behavior of All-Nighter suggests that high randomness does not affect average coherence, but its variance. High randomness might lead to less consistency in coherence in the sense that more output will have weak coherence, in exchange for more outputs with strong coherence.

Method

For this study, 324 people rated one randomly selected story from a set of 320 stories on various questions.

(If you were one of those raters – thank you so much! You made this possible!)

Everyone who took part in the story first read a story generated with Euterpe (30 generations – approx. 5 minute reading time) and then rated the story on 73 questions.

Click here for details on the methods

Question Set

I drafted a set of 73 potential questions. These are way more questions than we need, but out of experience, I expected to throw away 50% of the questions or more because they turn out to not be useful for various reasons. That is part of the point of the study – to figure out which questions actually work by looking how people actually approach these questions.

Questions were taken from existing research, drafted by me or… Sigurd (pro-tip for behavioral researchers: NAI is great for drafting scales!). I categorized them into 7 theoretical aspects that made sense to me at the time: Cohesion, consistent characterizations, creativity, general story quality, repetitiveness, style, pacing.

Click here For more details on the questions

Overview Question Set

Theorized Story Aspect Number of Questions Source(s) Number of Questions Kept for AISS v1
Cohesion 12 Narrative Engagement Scale (2)
Tambwekar et al., 2019 (8)
Own (2)
4
(All from Tambwekar et al., 2019)
Consistent Characterization 8 Own 4
Creativity 12 Inspired by DeLucia, Mueller & Li, 2021 (2)
Own (10)
3
(all own)
General Story Quality 8 Purdy et al., 2018 (2)
Inspired by DeLucia, Mueller & Li, 2021 (3)
Own (3)
2
(1 from Purdy et al., 2018, 1 own)
Repetitiveness 12 Purdy et al., 2018 (1)
Inspired by DeLucia, Mueller & Li, 2021 (1)
Own (10)
4
(1 from Purdy et al., 2018, 1 from DeLucia, Mueller & Li, 2021, 2 own)
Style 12 Purdy et al., 2018 (1)
Inspired by DeLucia, Mueller & Li, 2021 (1)
Own (10)
1
(own)
Pace 9 Own 4

Note: The final story aspect of questions that were kept did not always agree with the theorized story aspect.

Raters

I recruited people for rating the stories in two ways (roughly 50/50):

  • Asking people on the NAI and AIM discord as well as the NAI Reddit to participate
  • Recruiting participants from panels for academic research (SurveySwap.io and SurveyCircle.com)

Story Set

Using nrt, stories were generated by giving Euterpe a short prompt to establish the genre (High Fantasy, Hard Sci-Fi, Historical Romance or Horror). Euterpe ran through 30 generations with one of 8 default presets. This included every default preset except for ProWriter (was not a default preset when I sampled the stories) and Moonlit Chronicler (nrt did not support top-A sampling yet). The results were 320 stories (40 per preset) that would each take about 5 minutes to read through.

Click here for more details on the stories

Memory/Prompt Pairs Used to Generate the Data

Label Memory Prompt
Hard Sci-Fi [ Author: ; Tags: ; Genre: Hard Sci-fi ] "I have a message for you from the president," said Dr. Sato, handing over an envelope to me. "He's asking that we meet with him at his office this afternoon." I took it and thanked her before walking out of my apartment building into the bright sun. It was already noon on Mars—the longest day in the year here on the planet.
High Fantasy [ Author: ; Tags: ; Genre: High Fantasy ] The sun was high in the sky when they arrived at their destination. The valley of the River Tethys flowed into a wide, shallow lake surrounded by mountains on all sides. A small village sat along its shores with two towers standing guard over it from either side like sentinels. It looked to be deserted but for some smoke rising up out of chimneys and the occasional bird flying overhead or flitting through trees.
Historical Romance [ Author: ; Tags: ; Genre: Historical Romance ] The first time he saw her, the sight of her was like a slap across his face. She'd come into the tavern where he worked and sat at one of the tables in front of him.
Horror [ Author: ; Tags: ; Genre: Horror ] I woke up to hear knocking on glass. At first, I thought it was the window until I heard it come from the mirror again. I got out of bed and walked over to the mirror. When I looked into it, there was a face looking back at me.

Specifications of the Generation Settings

label Temperature max_length min_length top_k top_p top_a tail_free_sampling repetition_penalty repetition_penalty_range repetition_penalty_slope repetition_penalty_frequency repetition_penalty_presence order
Ace of Spades (14/02/2022) 1.15 40 1 0 0.95 1 0.8 2.75 2048 7.02 0 0 TFS, Top-p, Top-k, Temperature
All-Nighter (14/02/2022) 1.33 40 1 13 1 1 0.836 2.366 400 0.33 0.01 0 TFS, Top-p, Top-k, Temperature
Basic Coherence (14/02/2022) 0.585 40 1 0 1 1 0.87 3.05 2048 0.33 0 0 Temperature, Top-k, Top-p, TFS
Fandango (14/02/2022) 0.86 40 1 20 0.95 1 1 2.25 2048 0.09 0 0 Top-p, Top-k, TFS, Temperature
Genesis (14/02/2022) 0.63 40 1 0 0.975 1 0.975 2.975 2048 0.09 0 0 Top-p, Top-k, TFS, Temperature
Low Rider (14/02/2022) 0.94 40 1 12 1 1 0.94 2.66 2048 0.18 0.013 0 Top-p, Top-k, TFS, Temperature
Morpho (14/02/2022) 0.6889 40 1 0 1 1 1 1 2048 0 0.1 0 Temperature, Top-k, Top-p, TFS
Ouroboros (14/02/2022) 1.07 40 1 264 1 1 0.925 2.165 404 0.84 0 0 Top-k, Temperature, TFS, Top-p

The full dataset of all stories can be downloaded here.

Results Summary

Note: In some places, I added comments on the stats or statistical procedure in cursive parentheses (like this). They are there for people with stats education, but you can feel free to just ignore those!

Story Aspects

Using statistical analysis, I extracted what factors are sufficient to explain the answers to the questions. The results were 5 story aspects. So, the ratings that were given can be best explained with 5 main story aspects. Those were:

  • Coherence: Does the story feel plausible, coherent and has a clear thread? (higher = more coherent)
  • Creativity: Does the story feel innovative, original and exciting? Also includes perceptions of quality and enjoyability as well as complexity of writing. (higher = more creative)
  • Avoiding Repetition: Does the story avoid repetitions of elements? (higher = less repetitive)
  • Pace: Does nothing ever happen, or do things happen with a fast pace? (higher = faster pace)
  • Consistent Characterization: Are characters described consistently with no contradicting descriptions? (higher = more consistency)

AISS-v1

After establishing the story aspects, I looked at what items would actually be useful to measure each story aspect. I only kept questions that measured one story aspect well, and only one story aspect. Many questions measured several things at once, and that is not particularly useful for us. So, I sorted these questions out. This resulted in a questionnaire that is optimized to measure the aspects listed above: The AISS-v1.

The AISS-v1 is currently the only questionnaire for rating AI stories based on empirical analysis (more precisely, based on exploratory factor analysis). It should provide a robust instrument to understand how different models and settings influence how people experience the resulting story output. The current version of the AISS can be found here.

Are you really, really interested how I decided on these questions? Click the thing below for a layman explanation.

Click here for more details on the scale construction

Analysis of Rating Questions

Oh, you really clicked that thing? Well, hello there, curious traveler!

The following table should hopefully give a rough idea of the steps I went through in the analysis. This is an attempt to explain the general idea for people without statistical expertise. It is not a full explanation of the statistical procedures! This is the very rough, “overly simplified and probably wrong” version. Really, I just hope no statistician is reading this…

Nonetheless, “statistical procedure” in the table is the technical description of what I actually did. It is there for the people with stats education.

(Although, if you are interested in a full report of the analysis, maybe better wait until I did a technical write-up of the whole thing. I'll link it here when it is done.)

Step Explanation Result Statistical Procedure
1: Sort out questions extremely similar to some other question(s) We want questions within one aspect to be reasonably similar ofc (they should measure the same thing, after all). However, problems arise when questions are extremely similar to another question or a combination of other questions (i.e., highly correlated).
For one, having a question that is almost the same as other questions is pointless – we are fine keeping just one and dropping redundant items. More importantly, having questions that are extremely similar can lead to problems for the algorithm that determines the story aspects underlying those answers.
Dropped 42 redundant questions
31 questions left
Checked if determinant of correlation matrix was < .00001. If so, determine the item with highest VIF and exclude it. Rerun process until determinant was >.00001.
2: Check if data is appropriate for the analysis We want to determine our story aspects based on analyzing what questions show common patterns. For this analysis to make sense, we need to check whether the questions actually have enough overlap to begin with. I did this check for the data on all questions as a whole, as well as for the separate questions. Data on all questions was deemed suitable for the analysis Check of overall and item KMOs. Overall KMO = .86. All item KMOs > .6.
3: Determine ideal number of story aspects We assume that there are some underlying story aspects in people's mind that they think about when they are answering the rating questions. But how many are there? Before we can determine what questions belong to what aspects, we first must know how many aspects we are looking at.
For this, we look at stats telling us how well a certain number of story aspects would explain the answers to the questions. We are looking for the number of story aspects that explain the ratings on the questions as well as possible without going in the area of “diminishing results”. In addition, we want aspects that actual make theoretical sense. So, if in doubt we will prefer the number of factors that we can interpret well.
I used a manual inspection of the stats as well as checking if a new aspect performed better than you would expect if the data was purely random.
5 Story Aspects seemed to be the ideal number to explain the answers on the rating questions Scree-plot: Suggested 4 or 5 factors
Parallel analysis: Suggested 6 factors (it's close for #6)
6 Factors: 2 factors unstable (2 or less strong loadings), after removing cross-loading items no support for 6th factor
5 Factors: Stable factors with clear interpretation
4 Factors: Stable and clearly interpretable factors, but after removing low loadings and cross-loadings more support for 5 factor solution.
--> Chose 5-factor solution
4: Sort out questions that have very little to do with story aspects Just like we do not want questions that are extremely similar to others, we also do not want questions that have nothing to very little to do with the rest. All questions had reasonable overlap with each other Run EFA and determine communalities per item. Remove items with communalities < .2. Rerun EFA.
5: Sort out questions that don't measure any aspect well We want questions that measure one aspect at least reasonably well. So, I sorted out all questions that don't correspond well to any story aspect. I sorted out 5 questions. 26 questions left. Run EFA (Oblimin rotation, since factors had clear correlations). Excluded items with main loading < .4. Rerun EFA.
6: Sort out questions that measure more than one aspect. If a question corresponds to more than one aspect, it is not very useful to us. With such questions, answers will be very hard to interpret. For example, if a question measures aspect 1 and 2 at the same time, is a high score on this question due to aspect 1, 2, or even both?
To avoid these problems, I sorted out all questions that might measure more than one aspect.
I sorted out 4 questions. 22 questions left. Excluded items with cross loading > .3, or (main-loading - cross-loading) > .2. Rerun EFA.
7: Inspect current story aspects. Here it was time to have a closer look at the extracted story aspects. The aspects already had clear meanings based on their respective questions: Creativity, coherence, repetitiveness, pace and consistent characterization. I also examined how reliable and consistent the measurements for each of the aspects were.
For most aspects, this already looked good.
Measurement for repetitiveness was more reliable after removing one question (21 remaining).
Measurements for pace looked a bit unreliable, and consistent characterization had only 2 questions (we want 3+ for stable measurements)
I sorted out 1 question
21 remaining
Inspect high-loading items and Cronbach's α per factor. Consistent characterization only had 2 items. For pace, Cronbach's α = .69. Removed one question for repetitiveness to improve internal reliability, Cronbach's α = .76. Remaining 2 factors had Cronbach's αs > .75.
8: Strengthen measurement of weak aspects by adding back questions. We already have a more focussed idea of what aspects are reasonable to measure now.
But 2 of those are measured a bit wobbly. Maybe I was too strict in step 1 when sorting out strongly related questions.
So, I tried to add back a selection of 10 questions (31 questions total) that seemed to measure pace, consistent characterization or contradictions in general. Then I went through steps 1-7 with those questions again.
I sorted out 6 questions in that process, so we end up with 25 questions total. This made measurement of consistent characterization more reliable.
Measurement of pace was still not as consistent as I would have liked, but it is not catastrophic by any means. Measurement of pace could still be considered acceptable by conventional indicators.
Added 4 items back in.
Ended up with 25 items.
For pace, Cronbach's α = .69. Other factors, Cronbach's αs > .76.

This way I ended up with 8 questions for coherence, 7 for creativity, 3 for consistent characterization, 4 for pace and 3 for avoiding repetition. I used all of those for the analyses below.

However, the difference in precision you get from 8 versus 6 questions should be minimal. And I would like to avoid unnecessarily long questionnaires. Guidelines for constructing questionnaires like this generally recommend 4-6 questions per domain for these reasons.

So, I capped the number of questions per aspect to 6 for future studies. So, the final AISS v1 ends up with 6 questions for coherence, 6 for creativity, 3 for consistent characterization, 4 for pace and 3 for avoiding repetition – 22 questions total. This should take about 3 minutes or less to finish, so that should be fine for shorter studies.

Differences Between Presets

I saved the scores for each story aspect from the analysis above (factor scores). I varied the presets to generate the stories (all Euterpe presets except for Moonlit and Pro-Writer). That means I can use this data to look at what kind of output each preset tends to generate.

Overview of all Results for the Presets

I summarized the results of all analyses on the presets in the table below

Preset Performance during continuous story generation
(Without user intervention)
Genesis Less coherent than the average
Basic Coherence,
Ouroboros,
Fandango
Solid - average on all story aspects
Ace of Spades Avoids repetitions better than the average
All-Nighter Wide range of possible outputs in terms of coherence.
You'll get more incoherent outputs, but also more strongly coherent ones.
(= High variance)
Low Rider Wide range of possible outputs in terms of Pace.
You'll get more stories with slow pace, but also more fast-paced ones.
(= High variance)
Morpho Major issues with repetitions
Low pace (possibly due to repetitions), with a high consistency
Morpho will consistently generate stories with a low pace (= low variance)
Click here for more details on differences between presets

Differences in Average Performance

The following graph shows the performance of the presets for continuous story generation on the 5 story aspects.

A value of 0 represents the expected average over all presets. So bars that point downwards indicate performance below the average, bars that point upwards indicate performance above the average.

(These results are corrected for differences between prompts. So if a genre got more “difficult” prompts by bad luck, this is corrected for in this graph).

Graph of performance of the tested presets

Note that small deviations from average are probably just due to randomness. The following differences were strong enough that we have statistical proof to assume that these differences represent something real (testing with alpha = .1):

  • Genesis: Less coherent than the average
    (stats with sum (deviation) contrast coding are b = -.31, adjusted p = .04))
  • Ace of Spades: Less repetition than the average
    (b = .37, adjusted p = .01))
  • Morpho: More repetitions than the average
    (b = .72, adjusted p < .001))

At least for continuous story generation, I would therefore that you do not use Genesis. It seems to underperform on coherence, with no benefit. It is, of course possible, that Genesis does well on some other aspect that was not measured, but at this point there is no proof for this.

Based on these results, my recommendation for continuous story generation would be to use Ace of Spades as the default preset for continuous story generation. It does better at avoiding repetition than the rest. A possible caveat would be that it is unclear if Ace of Spades hurts memory or lorebook usage. However, Ace of Spades repetition penalty is not especially high (Genesis and Basic Coherence have higher rep pen). Instead, it seems to benefit from its extremely high repetition penalty slope. So early parts of the context should not be affected much from rep pen, while later parts will be strongly affected. Since memory and lorebook is usually at the beginning of the context, Ace of Spades should actually penalize these less than many other presets.

Morpho's lack of rep-pen really hurts its performance without user intervention. The story visible degrades the longer it continues with this preset. (Example outputs here. Arguably, Morpho was never intended to be used without user intervention. So, I do not recommend Morpho for continuous story generation. It might have its use to spice up single outputs or if you do not mind heavy editing and steering, but there currently is no data on this.

Basic Coherence, Ouroboros, and Fandango more or less perform the same on all story aspects. They are solid presets for continuous story generation, but at the same time they do not really stand out in any aspect.

Low Rider and All-Nighter also are more or less average on all story aspects. However, they turned out to differ in their consistency in some aspect. This might make them useful for special purposes, or at the very least make them interesting case studies. Read on below for more about this.

Differences in Consistency of Performance

Okay, if you are not familiar with statistics and the concept of variance, the following analysis might be a bit confusing. However, I think the findings are quite interesting, and I'll do my best to walk you through it.

So let's take a few steps and think about what presets might do:

  1. Cut out bad tokens – this should increase the average scores on story aspects. These average scores over several outputs are what we have analyzed so far.
  2. Change the distribution of token probabilities. The clearest case is the randomness setting. This can either make likely tokens even more likely, leading to fairly consistent outputs. Or it can make likely tokens less prominent and strengthen somewhat unlikely tokens. This would make outputs more varied. In other words, presets might differ in their variance of story aspect scores. Two presets might differ in how consistent their outputs are, even if they have the same average.

To visualize this point, imagine the following example:

  • Preset 1 only produces 50% outputs that are complete incoherent garbage (score 1) and 50% outputs with perfect stories that are totally plausible with a clear theme (score 5)
  • Preset 2 only produces outputs that are okay – somewhat incoherent in places, but in general fine (all score 3)

Both presets have the same average, but clearly different variance. Preset 1 is maximally inconsistent, preset 2 is maximally consistent. These are extreme examples, but it might be interesting to test if we have differences of consistency within presets. This is what this analysis does.

The result of this analysis was:

  • Presets differed in their consistency regarding coherence:
    • All-Nighter was less consistent on coherence than the average (adjusted p = .01).
  • Presets differed in their consistency regarding Pace:
    • Morpho was more consistent on Pace than the average (adjusted p = .08). Remember that Morpho also had lower pace than average in general. So, in other words, Morpho will consistently produce stories that have a low pace.
    • Low Rider, on the other hand, was less consistent on Pace than the average (adjusted p = .08).
  • Presets showed more or less the same variance when it came to Creativity, Repetition and Consistent Characterizations.
  • Prompts don't differ in variance for any of the story aspects, in case you were wondering.

So visualizing variance in a way that is easy to grasp is a bit tricky, but we will try box plots. (Strictly speaking, box plots don't visualize variance, but rather distance between percentiles. However, they are the easier to process than other visualizations and should get the point across best.)

Below are two graphs that compare the consistency of the relevant presets to the rest. For our purposes, the important thing to understand about the two graphs below is that the “box” in these graph represents the middle 50% of the data. So, a narrow box means that most data is restricted to a narrow range. A long box means that values have a wider spread.

Consistency of Coherence

Boxplot for coherence – All-Nighter vs. Rest

As you can see, All-Nighter has about the same average as the rest, but outputs are much more varied. All-Nighter will give you more outputs that are incoherent nonsense, but you should also have a higher chance for the occasional stroke of genius and get something that is more coherent than usual.

Consistency of Pace

Boxplot for Pace- All-Nighter vs. Rest

Low Rider has about average pace over all its stories, but the stories are much more varied in pace than usual. Low Rider will generate more stories with low pace, but also more fast-paced stories.

Morpho has a lower pace than average (see above), and seems very consistent about generating stories with low pace.

Conclusion from Analysis of Consistency

Remember that none of the presets produce more coherent or more fast-paced stories on average than the rest. However, All-Nighter shows less consistency when it comes to coherence, while Low Rider is less consistent about pace. As long as you do not mind spamming retry, these properties of the presets could be utilized if you want more coherent or fast-paced output:

  • If you want strongly coherent outputs and are willing to retry a bunch, try All-Nighter. You'll have to deal with more nonsense outputs, but you'll also get more good ones in exchange.
  • If you want fast-paced stories and don't mind retrying slow-paced outputs, try Low Rider. You will get more slow-paced output than usual, but in exchange, you will also get more fast-paced ones.

Discussion of Preset Findings and Recommendations

We can combine all the findings above to make some general observations and recommendations for the presets.

First, I should make some cautionary notes, though. This study compared output from continuous story generation with no user input. So, my recommendations are for users that want to use the AI in a similar way - long, uninterrupted chains of story output with little user input. It might apply to users that steer the AI more heavily, but it also might not. We do not have the data yet to see if those cases differ (I suspect they do not, but I frankly do not have the data to back that up).

With that out of the way, this is what I would take away as key recommendations from these findings:

  • Avoid Genesis - it is less coherent than other choices with no known upside
  • Use Ace of Spades as your default preset. It avoids repetition better than the other tested presets and has no known downside.
  • If you are willing to spam retry, presets with low consistency can give you two more potential options:
    • If you want superior coherence, stay up retrying with All-Nighter: You will have to sort out more incoherent outputs than usual, but will also get more strongly coherent ones as usual
    • If you are looking for fast-paced stories, ride the retry button with Low Rider: You will get more outputs with low pace, but in exchange you also will see more fast-paced ones

What settings seems to matter for the presets?

Most presets combine many sampling methods in different order. There is very limited understanding of how these setting actually interact, so understanding why we see some differences between presets is a bit difficult. But I want to make a stab at it anyway. This discussion might be most interesting to preset creators or more generally people who like to fiddle with the sliders for the advanced generation settings. Click the thing below if that is something that interests you.

Click here for a discussion of generation settings

We will start with stating some of the more obvious observations.

First, a large part of how well a preset does, seems to be determined by how well it avoids repetition. Morpho unsurprisingly doesn't do well here, since it applies no rep pen at all. Arguably, keeping repetition at bay isn't even the goal of Morpho. But for continuous story generation, the amount of repetition just tanks Morpho's performance. Other presets apply a rep pen between 2.165 (Ouroboros) and 3.05 (Basic Coherence). In general, within this range there does not seem to be large differences when it comes to repetitions – these presets perform about the same on avoiding repetition, except for Ace of Spades. This one positive exception is noteworthy, however. Ace of Spades applies a relative extreme repetition penalty slope of 7.02, while other presets stay well below 1.0. The fact that Ace of Spades seems more successful than other presets at avoiding repetitions with seemingly no downside suggests that it might be worth experimenting more with high values of rep pen slope.

Another preset that should be looked at more closely is Genesis. The default preset for Euterpe underperformed on coherence, and it is not immediately obvious why it should. Genesis could be characterized as very light sampling with Nucleus and TFS (0.975 for both) combined with low randomness (0.63). Something in that combo does not seem to work as intended, but it is unclear what exactly. Morpho comes somewhat close as a preset with no sampling and low randomness (0.6889), but surprisingly, it does not show the same issues as Genesis. Morpho, of course, does not have any rep pen, while Genesis is relatively heavy on rep pen (2.975). So, maybe low randomness combined with strong rep pen is somehow problematic? I have no clue why that should be the case, though…

Speaking of randomness, All-Nighter was an interesting case in that it performed around average coherence, but did so much less consistently than other presets. It had much more pronounced variance on coherence than other presets. We do not need to search long for the reason for this increased randomness. It is… randomness!

All-Nighter is essentially strong TFS sampling (0.836) combined with high randomness (1.33). Dialing up randomness does not seem to hurt average coherence much, but it does increase the spread around this average. There are two more results from low randomness presets that seem to fit well with this: Morpho (randomness = 0.6889) and Basic Coherence (randomness = 0.585), both had a tiny bit lower variance when it comes to variance. I did not list these results above because the evidence for these effects is fairly weak and might still just be due to chance (both adjusted ps = 0.149). However, it does add weight to the suspicion that randomness does increase the variance of coherence.

There is one last finding on presets that I am not quite sure what to make of: Low Rider and Morpho were less consistent than other presets on pace. Honestly, no idea why. The evidence on these effects are somewhat weak (both adjusted ps = 0.08) so maybe these are just coincidences. Future studies will have to show…

Which Prompts are Easy or Hard for Euterpe?

I generated the stories using 4 prompt/memory pairs (Hard SciFi / High Fantasy / Historical Romance / Horror). The AI was free-roaming, so quite a few stories did not adhere to these labels, but they should give a general direction (“Historical Romance” was basically never “Historic” for example). You can see the details of what each pair meant here.

Similarly to the comparisons between presets above, I can also use the scores for each story aspect to get an idea on how well Euterpe performed on each prompt:

graph differences story

A value of 0 represents the expected average over all prompts. So bars that point downwards indicate performance below the average, bars that point upwards indicate performance above the average.

(These results are corrected for differences between presets. So if a genre got more “under-perfoming” presets than others by pure chance, this is corrected for in this graph).

Just as for the analysis of the presets, small deviations from average are probably just due to randomness. The following differences were strong enough that we have statistical proof to assume that these differences represent something real (testing with alpha = .1):

  • "High Fantasy" and "Hard SciFi" both produced stories that were a more coherent, had less repetitions and a slower pace than the average
  • Additionally, "Hard Sci-Fi" produced story that were more creative and had more consistent characterizations than the average
  • “Horror” wasn't doing so well – Stories had more issues with low Coherence, low Creativity, inconsistent characterizations and more repetitions than the average. But hey, at least the pace was fast…

The fact that fantasy and science fiction stories are doing so well can probably be chalked up to the fact that the fine-tune dataset will have contained plenty of stories of this genre, so Euterpe is relatively familiar with them.

The fact that horror is doing so badly without an extra module might be worth a closer look, though:

Click here for a closer look at the horror stories

The prompt for Horror was this:

I woke up to hear knocking on glass. At first, I thought it was the window until I heard it come from the mirror again. I got out of bed and walked over to the mirror. When I looked into it, there was a face looking back at me.

Two typical outputs are here

In both cases, it feels like Euterpe understood what things sound scary in theory. But she just throws them in with no sense of connection to the story.

Just take the plot points from excerpt 2:

  • Scary demon appears in mirror in bedroom
  • Police is called (police is super common with this prompt somehow)
  • They go up to the bedroom to find the Sheriff lying there dead (?)
  • Protagonist floats up for some reason (?!)
  • Sheriff gets up as a monster and attacks the mom who is now suddenly his wife (?!?)
  • Protagonist floats downwards towards a light and finds himself reborn (?????)
  • Protagonist is now reborn, lives a good life and is fighting the devil… █▄▄ ███ █▄▄ █▄█▄█ █▄█ ▀█▀

Yeah, okay, that will feel like Griffin lead to low high pace and low coherence ratings, alright. We have lots of stuff happening, and taken for themselves, they are not necessarily out of place in a horror story. But they have no real connection to each other. On the one hand, this could just be a fine-tune thing. There might not be enough horror stories in the fine-tune dataset for Euterpe to be any good at this. On the other hand, this type of prompt might just be confusing for language models in general: The horror comes in this prompt stems from something being “terribly wrong” - mirrors should not work like that. Language models should in most cases be trained to avoid selecting tokens that are “wrong” like that. In most stories, we would want mirrors to show a reflection, not the face of a stranger. So in a way, this prompt might be an uphill battle for most language models.

Are NAI Users “Different”?

I sampled raters from two sources: The community (NAI Discord & Reddit, AIM Discord) and panel sites for academic research (SurveySwap.io and SurveyCircle.com).

Some of you might have not only rated a story, but might also have given the link to friends and family (big hug and thanks if you did!), but it is probably safe to assume that the majority of people from the community sample are users of NAI or some other language model tools. If I say NAI users in the following paragraphs, just think “people that used the link from the Discord or Reddit”.

The other source were panel sites for academic research: SurveyCircle and SurveySwap. They let you either drop a few bucks or do a few surveys to have people participate in your own survey (let's just say you can get many surveys done when you have to commute…). It is probably fair to assume that most participants from this source will be either students or the occasional junior researchers. So, not really a sample that is “typical” for the broader population, but still an interesting point of comparison to NAI users.

All previous analysis above corrected the scores for the sample source. The results I showed you were the expected scores for NAI users. But we can of course also examine if NAI users differ from a more “normal” population of mostly students. So let's do that! (The following analysis is corrected for influences of prompts and presets.)

graph with differences between NAI User and Panel Participant

As always, small deviations from average are probably just random noise. The following differences were clear enough to be statistically significant, though (testing with alpha = .1):

  • NAI Users rated story creativity lower than panel participants
  • NAI users rated avoiding repetitions and pace higher than panel participants

What does this mean? Umm… I am not sure either :-D. But here are some ideas: Both groups were told that the stories were AI-generated. However, NAI users will be more familiar with what AI-generated stories look like.

The creativity rating is interesting because I could imagine that panel participants unfamiliar with AI-stories actually might have lower expectations towards the story creativity and quality than NAI users. So, they might be surprised at the interesting ideas that a language model can generate.

Similarly, NAI users are more familiar with the fact that AI-generated texts can run into problems with repetition, so might be more willing to overlook slight issues with repetition.

No idea about pace; maybe we are just more patient readers…

That's all, folks (for now)!

And that wraps up the analysis for this study!

My original goal was mostly to get a good scale for rating model story output, and I think the AISS v1 is a good start!

I eventually want to do more research on this and see if there are more aspects or if those aspects can be refined more. However, this will be sloooooow progress. Remember I do this in my spare time, plus I am lazy…

But hey, if I get around to doing more work on this, it will be linked here!

⚠️ **GitHub.com Fallback** ⚠️