Research Methodology - TravelingRobot/NAI_Community_Research GitHub Wiki
See the NAI Knowledge Base for a short explanation of the Token Probability Viewer.
The Probability Viewer allows you to see what the AI considers to be the most likely tokens
- Useful to test the existing "knowledge" of the AI on concepts, historical facts, etc.
- Also useful as a base test to establish if context manipulations like lore book entries, memory, A/N, stage instructions etc. work as intended on the most likely tokens. Further, testing with different settings can then establish the reliability of the context manipulation.
For testing methods that require sampling, you can drastically reduce the time required for producing the outputs by using OccultSage's nrt
.
- Gnurro noted that a good test for a format's accuracy might be trying to override known AI biases
- Maki: Black 14/30, Brown 7/30
- Sakura: Black 14/30, Brown 6/30
- Satan: Black 15/30, Brown 6/30
- Fluttershy: Brown 10/30, Purple 8/30
Test input was (default settings):
[The following is a Q&A with Akari.]
Q: What is your name?
A: My name is Akari.
Q: What is your hair colour?
A:
Note: With the probability tool, such testing could be conducted much more effi
(pioneered by placebomancer, refined by various others) This method is describe on the knowledge base wiki Note that the method described in the link above is an exploratory method. Attributes discovered this way do no necessarily translate into good attributes that actually influence the output in the desired way. Attributes discovered this way still have to be confirmed by further testing. But this method can be a good first step to discover promising attributes.
(placebomancer)
This is essentially the reverse of the method in the link above.
For example, entering the following:
[ Robert E. Howard: Author]
[ sword & sorcery, pulp fantasy: Genre]
[ dark, gritty, violent:
generates Tone]
as a trait. Hence, tone:
seems to have useful associations and might be useful to use in A/N. The same cautionary note as for the method above holds true here: Traits discovered this way still have to be confirmed by further testing!
(placebomancer)
Enter something like this into input:
[ Synonyms: Robert Jordan; Brandon Sanderson ]
[ Synonyms: H.P. Lovecraft;
This way, the AI should give you similar authors to Lovecraft. Using the Probability Viewer should also speed up the process.
The general ideas in the following section, but does not take into account the improved tooling (new 'nrt' features and the probability viewer) we have available now. Just keep in mind that the procedures described below are probably unnecessarily complicated and should be replaced by something that makes use of more efficient tools.
In general, getting results that are close to actual use cases does require:
- sampling of outputs (since realistic outputs vary randomly)
- testing under a wide variety of contexts (to understand how specific findings are to certain conditions)
- a valid and reliable measure of story quality (or whatever your target variable is you want to affect)
a and b has gotten much easier with nrt, but still requires some work, time and patience. For c we still do not have ideal measures for rating every aspects of outputs.
That being said, the following are currently the best practice methods:
In general, evaluations of samples should strive to avoid confirmation bias. If you are for example comparing different variations of keywords for Author's Note (say Genre: Fantasy
versus Genre: fantasy
) you'd want to objectively rate whether that keyword was successful or not. Ideally, you would always have someone to quickly ask "Hey, what genre/tone/theme/etc do you think this is?". Ideally, someone that is blind to your testing conditions.
...Well, why not just do exactly that with Sigurd?
This is the idea of recursive testing - you take the output from your nrt
test set, modify it, and then feed the whole thing back into the AI as input. Usually, you will edit out your context manipulation (like A/N or memory) to make the test "blind" and add an evaluation question for the AI to evaluate the success of your manipulation.
Thanks to nrt
and the nrt JSON Builder & Launcher
automatized mass recursive testing is now possible. See the repo for the nrt JSON Builder & Launcher
for an example of a recursive test run.
Not everything can be evaluated with questions about the story to Sigurd (true for example for things like general story quality, cohesiveness, creativity, repetitiveness...). In these cases you still need to rate the outputs yourself. There are no validated measures for these things yet, but here is what I would recommend:
- Use
nrt
to generate your outputs under the testing conditions you want to compare (remember to have a control condition!). - When you are finished collecting your outputs (more is better, aim for 20 per testing condition at least!), open excel and look at the txt-files for your output. Try to not look at the permutation conditions and just look at the iterations/result.
- Rate each iteration on story quality (or whatever your target variable is you want to affect) from 1 to 5 in one excel column.
- When you have finished rating one output txt-file, look at the permutation condition of that file, not that down in anther excel column (or several if you permutated over several variables)
- Rinse and Repeat for all output files
- Have Excel compute the average rating per condition
- Do a simple t-test for independent samples (possible in excel or in an online tool) to see if the differences in quality are due to chance or significant. If you are not sure how to do this just hit me up on discord with your data and I can do that for you (my discord ID is in the footer, I need the columns with the conditions and ratings)
- If you are able to, repeat for different types of context, then hit me up with the results so I can run a regression to see if different contexts changes things