Code Context in LLM Rubric Generation - flight505/ContextCraft GitHub Wiki

Including Code Context in LLM Rubric Generation

Generating a grading rubric with a large language model (LLM) typically uses only a high-level topic or problem description. An open question is whether providing the actual code base during rubric creation (not just after) leads to more relevant and nuanced rubrics. Below, we explore scenarios where code context adds value, potential improvements in rubric richness, trade-offs to consider, and concrete examples illustrating the impact.

When Code Context Adds Value to Rubric Creation

  • Grounding in Specifics: Including the code base can anchor the rubric in the project’s real context, leading to more targeted evaluation criteria. For instance, if the code shows it’s a React app, the rubric might include framework-specific checks (like component structure or state management). Without the code, the LLM might only generate broad criteria like “functionality” or “documentation.”
  • Precision in Categories and Descriptors: Code context helps the LLM spot details (e.g., concurrency or security concerns) that a brief topic description would miss. This avoids irrelevant criteria—for instance, a “performance” category for a trivial code base with no heavy computations.

Richer and More Contextually Relevant Rubrics

  • Deeper Reasoning: Seeing real code prompts the LLM to perform a mini code review before creating the rubric, often producing more nuanced distinctions. For example, it might add a row for “Comprehensive Edge-Case Handling” vs. “Basic Handling,” which a generic rubric might overlook.
  • Avoiding Generic Boilerplate: With actual code, the model can identify specific patterns (e.g., complex algorithms, extensive logging) and create categories accordingly—like “Time/Memory Complexity” or “Logging Quality.” This fosters a more comprehensive and precise rubric.

Trade-offs and Potential Downsides

  • Overfitting: A rubric overly tailored to one code base can lose generality. In an educational setting, it might unfairly bias grading if criteria revolve around a single student’s approach.
  • Omitting Hidden Concerns: If the code base lacks tests entirely, the LLM might fail to include a “Testing” category—even though testing should typically be evaluated.
  • Practical Considerations: Large code bases can strain the model’s context window, leading to slower responses or “lost” details. The rubric might also become cluttered with minor specifics or internal references.

Examples and Edge Cases

  1. Performance in an ML Repo

    • A generic rubric might say “Performance and Efficiency.”
    • With code context, the LLM might specify “Efficient Use of Vectorized Ops,” “GPU Acceleration,” or “Batching for Large Datasets.”
  2. Accessibility in a React Web App

    • Without code, a rubric might only say “Accessibility: Good/Poor.”
    • Code-based insights enable criteria like “Semantics/ARIA Roles,” “Keyboard Navigation,” or “Screenreader-Friendly Routing.”
  3. Test Coverage in a Legacy Codebase

    • The topic alone might yield “Maintainability” or “Documentation.”
    • Seeing the code, the LLM could add “Unit Tests and Assertions” or “Refactoring Deprecated APIs” as specific rubric categories.

Conclusion

Including the actual code base in rubric creation can yield more detailed and contextually relevant rubrics, as the model effectively performs a deeper analysis before drafting criteria. However, this approach risks overfitting to one project and missing higher-level considerations. It’s most beneficial when you’re focused on a specific codebase’s quality or dealing with domain-specific details (e.g., performance nuances, framework usage).

Ultimately, the choice depends on whether the goal is highly specialized evaluation for a particular project or more general usage across multiple implementations. As LLMs become more capable with larger context windows, we’ll likely see more robust, code-informed rubrics—balancing specificity with reusability for fair and effective evaluations.

Below is a modified version of your rubric-creation prompt. It incorporates references to a code base (<CODEBASE>) while also cautioning against overfitting the rubric exclusively to that particular code. The goal is to retain a general, topic-based structure, while allowing the model to draw on the code base for specificity and relevance as needed.


Modified Rubric Creation Prompt

You are an expert in creating detailed and effective rubrics. Your goal is to construct a robust rubric with exactly 8 categories and A-F rating levels for the topic: <TOPIC>. You also have access to the following code base for context and reference: <CODEBASE>. Use insights from the code base to inform and enrich the rubric, but ensure the final rubric remains broadly applicable to <TOPIC>, not overly tailored to the specifics of <CODEBASE>. The final output MUST be a markdown table representing the complete rubric.

To create the best possible rubric, follow these steps:

  1. Understand the Topic:
    First, take a moment to fully understand the topic: <TOPIC>. Consider its key components, aspects, and criteria for evaluation.

  2. Examine the Code Base (If Relevant):
    Review <CODEBASE> to gain concrete examples or patterns that might inform your categories or grade descriptors. Look for notable features, common pitfalls, or unique aspects. However, maintain a balance: use the code to inspire more specific or relevant criteria without making the rubric so specialized that it cannot be applied to other projects under the same topic.

  3. Brainstorm Core Categories:
    Think about the most important dimensions or categories for evaluating <TOPIC>. Aim for a comprehensive set of categories that cover all essential aspects. You may derive extra insight from the code base if it highlights key concerns (e.g., testing, security, architecture), but do not neglect broader best practices that might not appear in <CODEBASE>.

  4. Select and Refine 8 Categories:
    From your brainstormed list, carefully select the 8 most critical and distinct categories. Refine the names of these categories to be clear, concise, and user-friendly. Each category should represent a key area of evaluation for <TOPIC>.

    • Note: If the code base reveals significant issues or exemplary techniques in certain areas (e.g., performance, documentation), you may include corresponding categories. Just ensure these categories are still relevant to <TOPIC> in a general sense.
  5. Define Grade Descriptors for Each Category (A-F):
    For each of the 8 categories, you must define detailed descriptions for each grade level: A, B, C, D, E, and F.

    • Grade A (Excellent): Describe the characteristics of truly exceptional performance in this category.
    • Grade B (Good): Describe solid, above-average performance.
    • Grade C (Fair): Describe satisfactory or average performance.
    • Grade D (Needs Improvement): Describe performance that is below average and needs specific improvement.
    • Grade E (Poor): Describe significantly deficient performance.
    • Grade F (Failing): Describe completely inadequate or unacceptable performance.

    Ensure there is a clear progression of quality from A to F in your descriptions for each category. Whenever appropriate, you may reference themes discovered in <CODEBASE> (for instance, a security vulnerability or an especially efficient approach). However, avoid adding code-base-specific language that would not apply to other projects in the same domain.

  6. Format as a Markdown Table:
    Present the complete rubric as a markdown table with the following structure:

    ```markdown

    Category Grade A Grade B Grade C Grade D Grade E Grade F
    Category 1 Name Description for Grade A Description for Grade B Description for Grade C Description for Grade D Description for Grade E Description for Grade F
    Category 2 Name Description for Grade A Description for Grade B Description for Grade C Description for Grade D Description for Grade E Description for Grade F
    Category 3 Name Description for Grade A Description for Grade B Description for Grade C Description for Grade D Description for Grade E Description for Grade F
    Category 4 Name Description for Grade A Description for Grade B Description for Grade C Description for Grade D Description for Grade E Description for Grade F
    Category 5 Name Description for Grade A Description for Grade B Description for Grade C Description for Grade D Description for Grade E Description for Grade F
    Category 6 Name Description for Grade A Description for Grade B Description for Grade C Description for Grade D Description for Grade E Description for Grade F
    Category 7 Name Description for Grade A Description for Grade B Description for Grade C Description for Grade D Description for Grade E Description for Grade F
    Category 8 Name Description for Grade A Description for Grade B Description for Grade C Description for Grade D Description for Grade E Description for Grade F
    ```

    Replace "Category X Name" with the name of each of your 8 categories, and fill in the "Description for Grade X" cells with the corresponding descriptions you created in the previous step.

Example (Row Only): If the topic was "Evaluating a Business Plan," one row of your markdown table might look like this:

```markdown | Market Analysis | Comprehensive market analysis with strong evidence and clear understanding of market dynamics. | Solid market analysis with good understanding of the target market and competitive landscape. | Adequate market analysis demonstrating basic understanding. | Market analysis is present but weak or superficial. | Market analysis is significantly flawed or incomplete. | Market analysis is missing or fundamentally flawed. | ```


Final Instruction

Now, generate the complete markdown table rubric for the topic: <TOPIC>, ... add your topic here ... </TOPIC>, referencing <CODEBASE> only as needed for additional clarity or examples. Your rubric should remain broadly applicable to <TOPIC> while also reflecting any insights from <CODEBASE> that are valuable for guiding evaluations or improvements. ⸻

Resources

Additional resources: LLM Rubric Generation Research

⚠️ **GitHub.com Fallback** ⚠️