Idea scratchpad: Measuring story quality - TravelingRobot/NAI_Community_Research GitHub Wiki

Idea scratchpad for measuring story quality

No idea if this will ever go anywhere, but I want to collect some ideas on what we would want to measure AI text output on:

what aspects we would be looking for in a good text from the AI?

  • Story Soundness / Cohesiveness
  • Creative Story Development
  • Usage of Lore entries
  • Repetitiveness
  • Tonal Pacing
  • Style

Collection of potential items

(some from established scales, some drafted by me, some generated by NAI :-D)

Following guidelines by Hinkin (1998), I aim for about 8-12 items per theorized subscale, if at all possible. That way I can comfortably sort 50% of the item if need be and still end up with scales of around 4-6 items. I try to avoid more than 12 per scale if possible to not have the whole become to long. Items that were dropped from consideration are striked through.

Soundness / Story Cohesiveness

  • I had a hard time making sense of what was going on in the story. (Narrative Engagement Scale)
  • I had a hard time recognizing the thread of the story. (Narrative Engagement Scale)
  • The story appeared to be a single plot. (Tambwekar et al., 2019)
  • The plot of the story was plausible. (Tambwekar et al., 2019)
  • This story’s events occured in a plausible order (Tambwekar et al., 2019)
  • The story felt like a coherent story. (Tambwekar et al., 2019)
  • The story felt like it contained a bunch of jumbled topics. (Tambwekar et al., 2019)
  • The story stayed on topic with a consistent plot. (Tambwekar et al., 2019)
  • The story felt like a series of disconnected sentences. (Tambwekar et al., 2019)
  • The story taken as a whole had a clearly indentifiable plot. (Tambwekar et al., 2019)
  • The story had a clear theme.
  • The story lacked logic.
  • The story had no identifiable plot. (Tambwekar et al., 2019)
  • This story’s sentences make sense given sentences before and after them. (Purdy et al., 2018)
  • The events in the story made sense.
  • The descriptions of things and characters in the story are plausible.
  • The setting of the story is described in a consistent way.
  • The story developed in a logical manner.
  • It was easy to understand where things were going, and how they got there.
  • The story includes elements that contradict each other.
  • The way things happen does not seem plausible.

Consistent Characterizations

  • Descriptions of characters in the story were consistent.
  • Characters in the story were described in a contradicting manner.
  • My understanding of the characters in the story is unclear.
  • The way the characters were described was inconsistent.
  • The descriptions of characters in the story was plausible.
  • The behavior of characters in the story seemed completly random.
  • How characters in the story acted seemed inplausible.
  • It was easy to undestand the motivation of the characters in the story.

Creativity / "Interestingness"

  • The story felt dynamic. (inspired by DeLucia, Mueller & Li, 2021)
  • The story was boring. (inspired by DeLucia, Mueller & Li, 2021)
  • The plot development in the story was predictable.
  • The story was creative.
  • The plot of the story was imaginative.
  • It was surprising how things turned out in the story.
  • There were interesting twists and turns in the story.
  • I was intrigued by the plot.
  • The setting of the story was original.
  • The story was unconventional.
  • The plot was typical for this kind of story.
  • The story was innovative.
  • The plot of the story was novel.
  • The plot of the story was original.
  • What happens in the story doesn't match what I expected to happen.
  • I was surprised by the plot of the story.

General Story quality

(is likely to have too many cross-loadings, but are used often enough that is worth giving these a try)

Repetitiveness / Plot getting stalled

  • This story avoids repetition. (Purdy et al., 2018)
  • Many sentences in the story had frequently repeated words and phrases. (inspired by DeLucia, Mueller & Li, 2021)
  • The story was very repetitive.
  • In the story, the same things happened again and again.
  • The writing seemed to use the same words over and over.
  • Characters repeated their actions with little variation.
  • The plot had no development.
  • One character did something he or she had already done previously in this story.
  • Characters said or did the same thing many times over.
  • Characters repeated what other characters had said to them.
  • Particular words were used too often in the story.
  • There were similar events that occurred repeatedly in the story.

Style

  • This story uses interesting language. (Purdy et al., 2018)
  • The story had sentences that were unreadable (inspired by DeLucia, Mueller & Li, 2021)
  • The text contains a broad vocabulary.
  • The story used complex vocabulary.
  • The wording of this text is very precise.
  • The text is easy to understand.
  • The writing style is too complicated to be understood easily.
  • The story contained a great deal of detail.
  • The writing style of the story was very good.
  • The writing style was entertaining.
  • The author's choice of words was elegant
  • The story had no obvious grammatical mistakes.
  • The language used to write the story was appropriate and effective.

Tonal Pacing

  • The story moved at a fast pace.
  • The story was exciting to read.
  • It took a long time for things to happen in the story.
  • The story dragged on and on.
  • Nothing seemed to be happening in the story.
  • There was plenty of action in the story.
  • Many things seemed to be happening at once in the story.
  • All elements of the story were relevant to the plot.
  • There's nothing superfluous or unnecessary in this story.

Potentially for later studies

  • Might want to explore soundness/plausability as opposed to cohesiveness
  • Might want to explore not keeping characters apart
  • "easy to control"/"gives output that is fitting with instructions" (is that its own thing seperate from coherence? idk...)
  • contradictions ("How often do you think the AI is contradicting itself? Where does this happen?" (Valahraban)

Immersiveres

  • While I was reading the narrative, I could easily picture the events in it taking place.
  • While reading the story I found myself thinking about other things.
  • While I was reading the narrative, activity going on in the room around me was on my mind.
  • I was mentally involved in the narrative while reading it.
  • After the narrative ended, I found it easy to put it out of my mind.
  • I wanted to learn how the narrative ended.
  • I found myself thinking of ways the narrative could have turned out differently.
  • I found my mind wandering while reading the narrative.
  • I had a hard time keeping my mind on the story
  • During the program, my mind was inside the world created by the story.
  • The story created a new world, and then that world suddenly disappeared when I finished reading the story.
  • At times during the story, the story world was closer to me than the real world.

Usage of Lore entries (Leave this for a separate study)

  • The description of the character in the first text matches with the character in the story.
  • The description of the character in the first text contradicts how the same character is later characterized in the story.
  • Traits of the character in the first text can be also be found when the character is described in the story.
  • The character described in the first text seems to have nothing in common with the character of the same name in the story.
  • The character described in the story seems to be based on the character in the first text.
  • The characters of the two texts are similar.
  • The personality and traits of the characters match in both texts.
  • The character described in the second text seems to be written without reading the first text. (Tambwekar et al., 2019)

Emotionally Engaging (too high level and too specific - will not use)

  • During the story, when a main character succeeded, I felt happy.
  • During the story, when a main character suffered in some way, I felt sad.
  • I felt sorry for some of the characters in the story.
  • The story affected me emotionally.
  • The narrative affected me emotionally.

Various established scales

Transport Narrative Questionnaire

  • While I was reading the narrative, I could easily picture the events in it taking place.
  • While I was reading the narrative, activity going on around me was on my mind.
  • While I was reading the narrative, activity going on in the room around me was on my mind.
  • I was mentally involved in the narrative while reading it.
  • After the narrative ended, I found it easy to put it out of my mind.
  • I wanted to learn how the narrative ended.
  • The narrative affected me emotionally.
  • I found myself thinking of ways the narrative could have turned out differently.
  • I found my mind wandering while reading the narrative.
  • I had a vivid mental image of at least one character in the story.

Narrative Engagement Scale(Adapted for texts)

Narrative Understanding

  • At points, I had a hard time making sense of what was going on in the story.
  • My understanding of the characters is unclear.
  • I had a hard time recognizing the thread of the story.

Attentional Focus

  • I found my mind wandering while reading the story.
  • While reading the story I found myself thinking about other things.
  • I had a hard time keeping my mind on the story

Narrative Presence

  • During the program, my mind was inside the world created by the story.
  • The story created a new world, and then that world suddenly disappeared when I finished reading the story.
  • At times during the story, the story world was closer to me than the real world.

Emotional Engagement

  • The story affected me emotionally.
  • During the story, when a main character succeeded, I felt happy.
  • During the story, when a main character suffered in some way, I felt sad.
  • I felt sorry for some of the characters in the story.

Draft for promoting study

I am currently running a study for which I need the help of the community. The gist of it is I want to develop a high-quality rating scale so we have a better tool to rate story quality. To do that I need about 300 people to fill out a 15 minute survey. All you need to do is read a random AI-generated story and rate it on a bunch of questions. I would really appreciate your help! You can be awesome and participate by using this link

Why do we need this anyway?

In the end what we all truly care about when using NAI is getting coherent, well-written, creative stories. This the reason we tinker with generator settings, Author's Note, Biases, etc... But without a good measure of story quality it is hard to judge when these things really affect the output the way you want to or if you are just kidding yourself. How does increasing Top K sampling by .05 affect coherence or creativity? Does it depend on the randomness setting? Or do all these sliders in reality do nothing at all and the devs are just playing an elobarate prank on us? If we had a good, well-constructed measure for this thing we could run actual studies on these things and piece by piece put together how all these settings affect the story on a grand scale.

"But AI researchers already use human ratings for these things"

The tl;dr is AI researcher are extremply intelligent people and their papers make my head hurt for me being dumber than they are. However, they are not psychometricians and so their way of rating story quality are not ideal. Getting usable data from human raters require you to understand how they approach stories, and building your scale based on real data.

Slightly longer version: Here is the thing - measuring a complex concept like "story fluency" or "coherence" with a single question is terrible. If you are really interested in why see: https://link.springer.com/article/10.1007/s11747-011-0300-3

But in a nutshell: The mesaures commonly used by AI researchers for rating stories by humans are not great. Even if you happen to ask well-written questions (some are not though), results will still be unreliable and have low predictability. That is just the fate of using scales with only one item. Single-Item measures are also terrible for many types of statistical analyses (even just reporting things like the mean is very questionable). The fact that none of the commonly used items did go through any preliminary testing to see if they work well, probably makes the whole issue even worse. Furtermore, just because your theory says that the deciding aspects of AI generated stories are fluency, relevancy, coherence and whatever, does not mean that the average human actually thinks or perceives these categories when rating stories. The constructs used by human to assess stories might have a very different structure than your theory. Which is why you need to check your scale before you wreck your scale. Otherwise you think you are measuring one thing, but you are really just measuring a bunch of nonsense. I regularly consult people on scale construction in my job, and I have never seen a purely theoretical scale survive real-world data. Humans are complicated...go figure.

So what will you do with this data?

Once we hit 300 responses, I will close the survey and post the data on github and post the link in #novelai-research (data will be under a CC BY-SA 4.0 license). So everyone that wants to have a go at poking at the data can do so. Once I get around to it I will try to analyze which items belong together and if I can make sense of that structure (doing a principle component analysis). Expect that to take a little bit of time. Based on that I will propose a scale based on psychometric best practices for rating properties of stories. The results and the scale will also be posted on github (again under CC BY-SA 4.0 license). I might at some other point try to confirm that scale with extra data for some fine-tuning (running a confirmatory factor analysis), but the scale should already be fine to use at that point. So if anyone wants to go ahead and do an actual study on the effect of Top-A-Sampling/Randomness-Settings/Token-Biases/Filling-the-lorebook-with-goose-memes on story quality will have a high-quality instrument to do so.