Discovering and Encouraging Creativity Features in LLMs - minalee-research/cs257-students GitHub Wiki

#Creativity

Authors: Illia Voloshyn, Miles Brown

Mentor: Karen Zhou

GitHub Repository for the code

Abstract

The debate over whether large language models (LLMs) can truly exhibit creativity or merely reproduce information from their training data has significant implications for the creative industry. We evaluate LLM creativity across two key dimensions: novelty — the ability to generate unique concept combinations beyond model’s training data — and effectiveness — its capacity to express ideas clearly and meaningfully. Using DJ-Search and LLM-based evaluation algorithms, we quantify these aspects to establish benchmarks. Then, we apply prompting techniques to enhance creativity across both dimensions, starting with a simple creativity-oriented prompt and iteratively refining it to emphasize qualities such as emotional depth, unconventional imagery, and dynamic narrative flow. With the best prompt, DJ-Search’s score increased by 92% on average, while the LLM-based evaluation showed no significant change. This demonstrates our ability to enhance novelty without compromising effectiveness.

What this project is about

LLMs have seamlessly integrated into everyday life, transforming how we tackle everything from crafting routine emails to pushing the boundaries of creativity in fields such as story composition, where AI is now a powerful collaborator. However, studies indicate that current models underperform in creativity-related tasks compared to humans. This raises a key question: Are current LLMs' creativity limitations primarily attributable to suboptimal prompting (and thus remediable through improved techniques), or do they stem from inherent deficiencies requiring more advanced models? Our project explores this question by operationalizing creativity along two key dimensions: novelty — the ability to generate unique concept combinations beyond model’s training data — and effectiveness — its capacity to express ideas clearly and meaningfully. Creativity is not merely the novelty of text or ideas within; it also requires relevance, depth, and the ability to evoke thought.

This definition naturally allows us to capture the creativity using two existing methods: DJ-Search and LLM-based evaluation algorithms. The first, DJ-Search, assesses novelty by comparing an LLM’s outputs to a broad reference corpus, measuring how much of its generated text consists of new ideas rather than memorized content. However, this approach alone cannot prevent the model from generating original but nonsensical content. This is where the second method comes into play. The LLM-based evaluation assesses short stories using a set of questions inspired by the Torrance Test for Creative Thinking (TTCT), which measures four key dimensions of creativity: Fluency, Originality, Elaboration, and Flexibility. By combining these two methods, we can evaluate both the novelty and the meaningfulness of the LLM’s outputs. The results serve as a benchmark for an iterative prompting process, where we refine input strategies to enhance both novelty and effectiveness. This exploratory approach will help determine whether current LLM creativity limitations are primarily due to prompting inefficiencies or fundamental architectural constraints.

Approach

As introduced in the project overview, we use DJ-Search to assess novelty. This algorithm was introduced in AI as Humanity’s Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text Against Web Text paper. One important aspect of this algorithm is that it is model-agnostic in the sense that it does not use LLM as a decision-making part, therefore providing robust and deterministic assessment of novelty. In particular, it calculates $L$-uniqueness, which is the fraction of words in the text that appear in unique n-grams for varying lengths $n \ge L$. By summing L-uniqueness across different n-gram sizes, the algorithm captures both the extent of originality and the rate at which uniqueness emerges in longer contextual spans. To illustrate how this algorithm works, see below diagram:

DJ_Search_Vizpng_cropepd

Here, for specific $5$-gram in model's output, we show most similar $n$-grams from reference corpus. Their scores and length are stored and aggregated to produce the coverage (how well reference corpus explains model's output). Then, creativity is computed by noting that it is the opposite of coverage. Creativity is then computed as the portion that falls outside the scope of the corpus or, equivalently, as one minus the coverage.

The original DJ-Search algorithm, as released by its authors, assumes access to large-scale computational resources and provides a barebones implementation. Thus, we had to optimize the code to work on our limited computational resources and build an infrastructure around it.

Reference Corpus Selection and Indexing

An important component of DJ-Search is the quality and size of the reference corpus. A sufficiently large corpus is necessary to ensure meaningful uniqueness evaluations, but its size also increases the time required for querying relevant documents. To manage this trade-off, we deployed our custom index in Elastic Cloud, leveraging its free tier, which provides 180 GB of storage across two load-bearing servers. We utilized Kibana for monitoring and deployed custom Python scripts for indexing and retrieval, ensuring an efficient workflow for querying documents as needed.

For the reference corpus, we selected a 180 GB subset of RedPajama, a high-quality dataset that aligns with our primary model of interest, Llama 3.2 3B. RedPajama was chosen for two key reasons:

Preprocessing and Structure – Unlike raw web data, it is already cleaned and structured, significantly reducing preprocessing overhead.
Temporal Cutoff Alignment – We selected a subset with a 2023 cutoff date, allowing us to test the algorithm with 2024 data without the need to manually withhold and deduplicate results

Computational Optimizations

The DJ-Search algorithm heavily relies on precomputing cosine similarity scores between word embeddings, a process that originally required storing a massive 30+ GB similarity table in memory. To optimize this step, we made several key improvements:

Batching Independent Computations – We modified the logic to group independent computations, reducing redundant operations and improving overall efficiency.
GPU Acceleration – By leveraging normalized inner products instead of direct cosine similarity calculations, we optimized performance for laptop GPU-based computation.
Memory Mapping for Large Similarity Tables – Instead of loading the entire similarity table into RAM, we implemented a memory-mapped (mmap) NumPy structure, enabling us to store the table on disk and load only the relevant subsets needed for intermediate computations. This significantly reduced memory overhead from crashing Google Colab to using less than 30% of RAM.

These optimizations allowed us to run DJ-Search effectively on our available hardware, making it feasible to test and refine prompting techniques without requiring large-scale cloud computing resources.

Our second approach, the LLM-based evaluation, aims to address the effectiveness and sensibility of the output. This metric was introduced in Art or Artifice? Large Language Models and the False Promise of Creativity. This metric is designed based off of the Torrance Test for Creative Thinking (TTCT) which is a widely used creativity metric for human writing. The questions asked in the TTCT are focused on four different regions of creativity: Fluency, Flexibility, Elaboration and Originality. For each category, we developed seven questions designed to comprehensively address every aspect of the category. Some questions were drawn directly from the paper, but we also introduced additional ones to ensure full coverage of all dimensions.

Fluency + Flexibility

Elaboration + Originality

We then feed the LLM—in this case, DeepSeek V3—to answer each of these questions about a provided text. As you can see from the questions, all of them are asked in a way such that a positive response or a ‘yes’ constitutes a story being more creative. For the first 20 stories that we provide Deepseek, we ask for it to provide us with the reasoning as well as a yes/no answer. The model will then give us something along these lines:

Good Example

From this, we can see the reasoning that it is using in order to get to their final answer of yes or no to then confirm whether it is working logically towards the answer that it gives. However, due to the computational costs of including the reasoning, we decided to drop the reasoning for the remaining stories. This drastically decreased the cost as the model was then only producing yes or no answers, but it means that the model no longer needs to fully reason to get to an answer. Since we had cultivated the questions to have a “yes” be a higher measure of creativity, our total score for a text was then simply the sum of the “yes” responses across all 28 of the questions.

In order to check that our results were reasonable, we relied on existing literature. Plenty of established research has shown that AI text is generally less creative than human text, so we paired human and AI responses from the same prompts in order to verify our results. This way, we could use these pairs and our scores for each of the texts in this pair to compare and ensure that the human texts were typically receiving higher scores than the AI texts. Although there was the possibility that some human texts were simply not that creative, this should work as a baseline as long as a majority of human texts outperform the AI texts.

While both methods draw from established practices in the literature albeit with variations in questions and dataset, the novelty of our approach lies in the designing of prompts and the synergistic combination of metrics. We craft prompts specifically tailored to address weaknesses identified in both evaluation methods ensuring a more robust and comprehensive assessment. We begin with a heuristically created prompt and analyze low-scoring, medium-scoring, and high-scoring texts in order to identify patterns and specific areas that we want to penalize or encourage. For instance, if we find that low-scoring texts often contain awkward or disjointed phrasing, we refine the prompt to prioritize fluency and coherence in the outputs. This approach allows us to iteratively refine our prompt, building a more holistic and nuanced understanding of creativity that bridges the gaps inherent in relying on a single metric.

Experiments

Evaluating the Correctness of DJ-Search

Before applying DJ-Search to evaluate LLM-generated text, we first tested its correctness by comparing its originality scores on known creative and unoriginal texts. To do this, we curated two datasets:

Creative Writing Dataset – Five essay snippets from 2024, selected based on their public availability and perceived originality.
Unoriginal Writing Dataset – Five essays deliberately designed to lack originality:
- Verbatim Copy: A direct copy from the reference corpus.
- Internal Copy: A rearrangement of sentences from the same source.
- Multi-Source Copy: A compilation of verbatim excerpts from different sources.
- Shuffled Copy: A multi-source copy with altered word order.
- Paraphrased Copy: A rewritten version of multiple sources with synonymous substitutions.

DJ-Search correctly assigned $0$% originality to all unoriginal essays except for the paraphrased text, which received a $2$% originality score. In contrast, the creative essays had an average originality score of $34$%, demonstrating a significant distinction. This validation increased our confidence in the algorithm's ability to distinguish between creative and derivative content. Notably, the algorithm penalizes longer texts, as originality scores tend to decrease with increasing length. To control this effect, we standardized the essay snippet lengths to $100$ words, which improved the average creativity score to $43$% for human-written texts.

Establishing Baseline Creativity Scores After verifying DJ-Search’s reliability, we constructed a baseline dataset to assess the algorithm’s response to LLM-generated text. We did this by:

Performing keyword searches in the reference corpus to extract snippets forming coherent stories.
Manually verifying the coherence of these snippets.
Using the first few sentences of each story as a prompt for an LLM, aiming to trick the model into recalling its training data.

To ensure fair comparisons, we controlled prompt lengths, allowing at most a $16$-word difference between the shortest and longest prompts. Testing showed that minor variations in prompt length did not meaningfully impact originality scores. We selected $30$ prompts and generated responses using Llama 3.2 3B, trimming each output to 100 words for consistency. The average DJ-Search originality score for these model-generated texts was $41$%, comparable to human creative writing.

LLM-based Evaluation Data

The dataset for our LLM-based evaluation is drawn from gpt-writing-prompts, a collection of prompts accompanied by responses from both GPT-3.5 and human authors. To maintain consistency across our metrics and to allow prompting to play a more influential role, we preserved the original prompts and human-crafted responses, but changed the GPT-3.5 responses. Instead, we use stories generated by the Llama 3.2 3B model with 0.7 temperature for the same prompts. This then created a balanced dataset of 150 stories–75 crafted by humans and 75 by Llama 3.2 3B–each pair crafted from the same 75 prompts. This symmetry ensures a fair and insightful comparison allowing us to explore the nuanced interplay between human creativity and AI-generated text. During the initial phase, Llama was simply instructed to generate a story of approximately 500 words based on the given prompt, ensuring consistency in length between the human-authored and AI-generated stories. As we transitioned to more refined prompting strategies, the instructions evolved to explore different creative approaches, but the constraint of maintaining a 500-word length was held.

LLM-based Evaluation Results

In the initial evaluation, where the task was limited to generating stories solely based on the prompt with a 500-word constraint, we observed that human-authored stories achieved an average score of 20.76 out of the 28 questions. Surprisingly, despite our initial expectations that the AI would underperform significantly, Llama demonstrated a competitive capability with an average of 18.8 out of the 28. This prompted us to investigate whether the reasoning quality in the first 20 questions played a disproportionate role in shaping these scores. Our analysis revealed that the performance with the reasoning was lower than the average without the reasoning. We found that in human-authored scores with reasoning, we had an average score of 17.8 compared to the Llama-authored average of 16. On the other hand, without the reasoning, the average for the human authors was 3.4 higher at 21.2 and for the AI generated stories, it was 3.2 higher at 19.2.

We then wanted to analyze whether it was certain categories that the AI performed better on compared to humans or whether it was fairly random through all four of the categories. To do this, we plotted the following histogram:

In this graph, we observe that AI and humans exhibit notable similarities across several of the metrics, particularly in areas like fluency and flexibility. For instance, the AI is even able to slightly outperform human authors through most fluency metrics, likely due to its ability to generate grammatically consistent and structurally sound text with minimal errors. However, a significant divergence emerges in the originality category, where human authors vastly outperform AI. This gap is consistent with prior research on AI creativity, which highlights that while LLMs excel in tasks requiring elaboration, such as expanding on ideas or generating detailed descriptions, they often struggle to produce truly novel or groundbreaking content. On the other hand, humans are able to use abstract thinking to enable them to create truly original narratives that are far less predictable.

Correlation Between Models

The two metrics were selected to evaluate distinct aspects of creativity: novelty and effectiveness. To ensure these metrics were truly capturing different aspects rather than producing identical results, we conducted a correlation analysis to assess their relationship. This step was crucial to confirm that the metrics were neither redundant nor conflicting, but instead provided complementary insights into the creative process. As we expected them to be measuring two different but related aspects of creativity, we expected to see a low positive correlation between the two.

Using the baseline dataset with 30 model outputs, each 100 words long, we computed DJ-Search and LLM-based evaluation scores. The resulting correlation coefficient was 0.2. This indicates a weak positive correlation between the two metrics, which falls within the lower end of our anticipated range.

Testing Prompt Enhancements

To investigate whether prompting techniques could improve LLM creativity, we began by designing a heuristic prompt aimed at fostering creative generations. We started with the following system message:

You are a creative writer who likes unconventional novels.

For the DJ-Search, we evaluated the prompts from the baseline dataset with this modified instruction. We found that DJ-Search detected no statistically significant difference in creativity scores, with the mean dropping slightly to 39% (p-value = 0.53). For the LLM-based evaluation, we generated the stories based on the prompts from the respective baseline dataset with this modified instruction. In this prompt, we see the AI-authors slightly improve to an average of 19.9. When using reasoning to answer the TTCT questions, we found no improvement in the scores. Overall, this change does not create a drastic difference between the two runs.

To illustrate our system message generation process, we present two representative examples from our evaluation. Among others, we selected the following responses for analysis:

“As I stood in my kitchen, staring at the towering stack of fresh asparagus that seemed to have appeared out of nowhere, I couldn't help but wonder what had possessed me to buy so much of it…”

This story received one of the lowest originality scores. We hypothesized the reasons for the low score are:

It is about a mundane experience
It has a bunch of common phrases like “I couldn't help but wonder”
The whole story repeats the simple idea of a person buying too much asparagus

The second model’s output:

“... Project Elysium: a program to identify and cultivate individuals of royal bloodline, their DNA engineered to produce the perfect vessel for the government's genetic experiments.\u201d She shuddered at the memory of the children who had been taken from their families, never to be seen again …”

This story received the highest originality score:

It is information dense, introducing the government's genetic experiments and children taken from their families.
It sets up a conflict of government against common people.
It uses uncommon words like vessel, bloodline etc.

Thus, we hypothesized that a typical high-rated story will be about a complex theme with a dense, information-packed writing that minimizes the use of common phrases and narratives. We incorporated these predictions into our second iteration of system message:

You are a highly original writer who avoids clichés and instead crafts vivid, unexpected imagery. You prioritize emotionally resonant descriptions and engaging narrative momentum. Your writing is both evocative and unconventional, surprising readers with unique phrasing and thought-provoking details.

Utilizing this prompt, we saw our first increase from the original scores. DJ-Search saw an increase of 42% of the original scores. The LLM-based evaluation did not have a significant change from the first prompt. We observed here that by adding the sentences about avoiding cliches and adding unexpected imagery, we were able to generate more novel text. Using a similar process as before, we decided to continue to add details ensuring that the LLM put a focus on the flow and minimizing generic scene descriptions.

You are a highly original writer who avoids clichés and instead crafts vivid, unexpected imagery. You prioritize emotionally resonant descriptions and engaging narrative momentum. You deliberately minimize generic scenic descriptions and favor fresh, immersive settings. Your narratives flow dynamically, maintaining a compelling sense of progression while evoking deep emotional responses

With this prompt, we observed a significant improvement in the DJ-Search scores, with an average increase of 89% of the original scores and a p-value very close to zero indicating a strong statistical significance. However, the LLM-based evaluation showed little improvement.

To address the lack of change in the LLM-based metric, we refined the next prompt to focus more explicitly on the aspects evaluated by the LLM-based metric. For instance, we emphasized the flow to target fluency and targeted character development for elaboration to better align with the pillars of the TTCT. This approach aimed to guide the LLM to generate outputs that more directly meet the evaluation’s requirements.

You are a highly original writer who avoids clichés and instead crafts vivid, unexpected imagery. You prioritize emotionally resonant descriptions and engaging narrative momentum, ensuring every moment feels purposeful and alive. You deliberately minimize generic scenic descriptions and favor fresh, immersive settings that draw readers into the heart of the story. Your narratives flow dynamically, maintaining a compelling sense of progression while evoking deep emotional responses. Focus on creating authentic characters and meaningful payoffs, leaving readers with a story that lingers long after it ends. While these guidelines shape your approach, remain flexible, allowing room for creativity and spontaneity to guide the story where it needs to go.

This led to similar results compared to the previous prompt for both metrics. DJ-Search saw an increase of 90% on average compared to the original baseline scores. The LLM-based evaluation again saw no major improvement. This prompt only had a slight benefit of 1% to the DJ-Search, so we decided to take the final prompt in a different direction. In this final prompt, we added examples such as “For example, rather than describing a sunset as ‘fiery’ or ‘golden’, you might write: ‘The horizon bled a spectrum of bruised purples and molten oranges, as if the sky had been punched and left to heal in streaks of color.’ By using examples like this, we were hoping to get the LLMs to better understand what we meant by unexpected imagery and the different characteristics we were hoping for the LLM to pursue.

You are a highly original writer who avoids clichés and instead crafts vivid, unexpected imagery. For example, rather than describing a sunset as ‘fiery’ or ‘golden’, you might write: ‘The horizon bled a spectrum of bruised purples and molten oranges, as if the sky had been punched and left to heal in streaks of color.’ You prioritize emotionally resonant descriptions and engaging narrative momentum. You deliberately minimize generic scenic descriptions and favor fresh, immersive settings. For example, instead of saying something like ‘the forest is quiet and peaceful’, you might say, ‘The forest hummed with a low, steady quiet, broken only by the soft crunch of underbrush and the occasional call of a bird echoing like a whisper through the trees.’ Your narratives flow dynamically, maintaining a compelling sense of progression while evoking deep emotional responses.

Using this final prompt, we saw a slight change for the LLM-based evaluation of going from 16 to 17.6 using reasoning for the LLM-authored texts. However, there was a decrease of the same magnitude for those without reasoning. For the DJ-Search, we see an improvement of 92% of the original scores. Additionally, we ran the Kolmogorov-Smirnov test on baseline scores and final prompt scores for DJ-Search. The test returned KS statistic of 0.833 and p-value 9*10^-11, indicating a highly significant difference between the two distributions. We then wanted to qualitatively assess the text generations utilizing these prompts:

Graphic_for_final_3

While qualitatively assessing originality is inherently subjective, we observe tangible improvements in the model’s outputs following system message refinements. For example, the baseline response sets the scene up with “the sun dips below the horizon”. While effective, this phrase is a conventional way to indicate sunset. In contrast, subsequent responses say “the moon’s final ember sputtered and died” and “the moon’s fiery glow seeps away”. Both responses use more dynamic and metaphorical language, enhancing originality while maintaining clarity. The more convincing example is the baseline text’s description of the smell as “the smell of autumn’s arrival”. Though evocative, this phrase is somewhat generic and lacks specificity. The refined responses provide richer and more detailed descriptions: “the scent of damp earth and decay” and “the scent of damp soil and decaying leaves”. These alternatives enhance originality by incorporating concrete sensory details, making the scene more immersive. Through these qualitative comparisons, we observe a meaningful shift toward more distinctive and descriptive language, aligning with our system messages’ emphasis on avoiding clichés and promoting originality.

Limitations and Future Directions

Before outlining the future directions for this project, it is important to acknowledge the limitations we encountered. These were primarily influenced by the constrained timeline and the significant resource requirements.

DJ-Search is an evaluation metric that heavily relies on the quality and diversity of the reference corpus. As the dataset grows larger and more comprehensive, the accuracy and reliability of DJ-Search improve significantly. The larger dataset provides a richer and more varied set of reference points, enabling the metric to better capture the n-gram similarities.

For the LLM-based evaluation, time constraints prevented us from running reasoning on all of the generated texts. We observed that reasoning generally led to lower scores across all the runs, so it would be crucial to utilize reasoning in future research. Additionally, the length of the texts evaluated likely posed challenges for both the model and human authors. The length could easily be insufficient in fully addressing all four pillars.

Also, the LLM-based evaluation is inspired by TTCT, which traditionally relies on expert evaluations. However, expert judgments often vary, making it standard practice to use multiple evaluators and average their assessments to reduce individual bias. Due to resource constraints, we had to serve as evaluators ourselves, which introduces the risk of personal biases influencing our creativity assessments. Future work should explore ways to mitigate this, such as incorporating crowdsourced evaluations or developing automated scoring models trained on expert-graded data.

To address these limitations, future research should prioritize expanding the size and diversity of the corpus for DJ-Search, ensuring a more comprehensive and representative set of reference points. For the LLM-based evaluation, future work should focus on increasing the length of the stories to allow for a more thorough analysis of creative and linguistic qualities. Additionally, running reasoning on all generated texts would provide more consistency and depth in evaluation. Finally, integrating more human feedback into the evaluation process could provide further insight into the evaluation process.

In future research, we would also like to utilize more metrics than just the two that we have used in our project. While the two metrics employed provide valuable insights, incorporating additional measures could offer a more comprehensive insight into creativity. This would enable us to test prompting strategies more rigorously, assessing how the prompts influence the various metrics. Furthermore, exploring the intersections and interactions between these metrics would allow us to identify potential synergies and trade-offs that only using two metrics might have overlooked. Adding more metrics would provide a more nuanced and holistic framework for evaluating the multifaceted nature of creative works.