Lab 4 ‐ Large Language Models - ftsrg-edu/ase-labs GitHub Wiki

Introduction

Continuing on the LSP and IDE developed development, you will implement LLM-based intelligent editing services for the .dataspace language. First, we will create a chat application that can answer questions about dataspce models and help modifying them. Later, we will create a workflow that leverages multiple LLMs to turn the BME rules for Student Feedback on Teaching (OHV) into a dataspace model.

The expected learning outcomes of this session are the following.

  • 🎯 You should have sufficient skill to implement a chat application that combines LLMs with classical source code intelligence services (validation).
  • 🎯 You should have sufficient skill to build and orchestrate LLM workflows with data extraction and structured output.

Notation

Tip

The guide is annotated using the following notation.

Regular text: Information related to the laboratory.

💡 Extra insights about the current topic.

✏️ You should perform some specific exercise.

📋 CHECK: You should check something and write down your experience.

🔍 You should do some research to continue.

Suggested Reading

Starter Project

You may find the started project in the lab-4 branch of lab repository.

Tasks

Task 0: Project Onboarding (0 points)

For this session, we added 3 new commands to the dataspace CLI application: chat, extract, and generate. You may find the implementation of these commands in the src/cli/ai directory.

✏️ Install the dependencies of the project with npm install.

💡 Large Language Models (LLMs) provide REST-based APIs to access them programmatically and generate chat responses. In contrast with the browser-based LLM chat interfaces, this lets us integrate them with other tools and build intelligent workflows and agents. However, each API call is usually billed according to the number of input (prompt) and output (response) tokens. As a rule of thumb, a token is about 0.75 English words.

In this lab, we will use OpenRouter to access LLM APIs. OpenRouter is a proxy service that lets us access LLMs from multiple vendors (e.g., OpenAI, Anthropic, Google, DeepSeek) and apply billing controls.

To access the API (and let OpenRouter bill our usage), you will need an API key.

Warning

The API key allows initiating requests that cost actual money. Always handle the API key securely and never expose it (e.g., by commiting it to GitHub). If your API key gets compromised, revoke it and generate a new one immediately.

✏️ Open the course page on Moodle and generate a new API key for yourself. Then save the API key into an environmental variable in your terminal.

You can set an environmental variable in Linux-like environments (including MacOS, WSL, and Git Bash) as follows:

export OPENAI_API_KEY=<your api key here>

You can set an environmental variable in PowerShell as follows:

$Env:OPENAI_API_KEY=<your api key here>

You can set an environmental variable in cmd.exe as follows:

set OPENAI_API_KEY=<your api key here>

Warning

The API key will not be shown again, so make sure that you have set the environmental variable. You will have to run any commands that access LLM API in the same terminal where you have set the variable.

💡 Per-token pricing of most LLM providers is in the $0.2-$20 per 1 million token range. As we will process much less than 1 million tokens, the API key with a $1 limit should be sufficient for completing the exercises in this lab.

✏️ Build the application with npm run build, that start the chat application with node dist/cli/main.cjs chat dataspace-example/students.dataspace.

📋 CHECK: The application should display an LLM-generated summary of the dataspace-example/students.dataspace model.

Use /exit to exit the application.

Task 1: Chat Application (5 points)

💡 In this task, we will create a chat application that lets you chat with and modify dataspace models. You can find a skeleton implementation of the chat application in the src/cli/ai/chat.ts file.

The chat application uses a relatively lightweight and fast LLM, openai/gpt-4o-mini, but you can try similar LLMs like google/gemini-2.0-flash-001 or even deepseek/deepseek-chat.

🔍 Open the src/cli/ai/chat.ts file and study how it communicates with the LLM using the openai client package.

The interaction with the LLM is based on 3 main types of messages:

  • system messages give general instruction to the LLM about how to behave. They can be simple for general-purpose assistants (e.g., You are a helpful assistant.), or detailed for specialized agents. Usually, only a single system prompt is provided, and it is the first message in the input context.
  • user messages are inputs from the user. More than one user message may appear in the input context to let the LLM access the whole chat history.
  • assistant messages are the responses from the LLM. You can also add assistant messages to the input context as part of the chat history. The output of the LLM is always a single assistant message.

Certain LLMs may support other message types, for example, to enable tool calling.

✏️ Complete the system prompt in src/cli/ai/system-prompt.txt by describing the rest of the data space language.

Tip

To avoid cross-contamination of information between the system prompt and the user input, it is helpful to use a different example than the dataspace-example/students.dataspace model. For example, you can use the following example from the medical domain to illustrate dataspace concepts to the LLM:

schema MedicalRecord {
    @PII patientId: string
    age: number
    deceased: boolean
}

stakeholder Hospital {
    owns dataset patients: PatientData
}

stakeholder Patient {
    subject of Hospital.patients {
        gives consent to ResearchInstitute
    }
}

stakeholder ResearchInstitute {
    provides service generateStats: PatientData -> MedicalRecord
}

stakeholder PublicHealthAgency {
    consumes dataset healthReports: PublicHealthStats
}

service chain GenerateEpidemiologicalReport {
    first Hospital.patients
    then ResearchInstitute.generateStats
    then PublicHealthAgency.healthReports {
        patientId <- pseudonymize(patientId)
        patientAge <- age
        deceased <- deceased
    }
}

✏️ Implement the rest of the conversation loop in src/cli/ai/chat.ts. You should pass the user input to the LLM and print its answer.

Tip

Always record the user input and the LLM output in the messages array so that new questions have access to previous questions and answers as part of the context.

📋 CHECK: After compiling with npm run build, run the chat application on the dataspace-example/students.dataspace model. Ask some simple questions (e.g., Who has access to private student information?), and save the transcript into your repository.

Now we will integrate the Langium-based validation for the dataspace language with the chat application. This significantly extends the capabilities of the chat application compared to the LLM alone: by running the validator, we can make sure that code generated by the LLM is syntactically and semantically valid.

✏️ Modify the second part of the prompt (beginning with In the following, the user will ask a series of questions.) so that the LLM repeats the full, modified code between <code> and </code> tags when the user asks for a code modification.

Tip

It is useful to also specify that the LLM should omit any other formatting (such as Markdown tags) inside the generated code.

💡 The two approaches for generating code modification with LLMs are repeating the full, modified code (as in our exercise), or generating patches in the unified diff format. While repeating the code may introduce unwanted changes in other parts (such as omitting some code), applying and LLM-generated diff is challenging, because most LLM-generated patches are invalid and require a degree of fuzzy parsing and matching.

✏️ Write a helper function (e.g., getCode(content: string): string | undefined) to extract the generated code from the LLM response.

✏️ If the LLM response contains generated code, run Langium validator on it. Print the validation errors, if any, to the console.

You can instantiate the Langium services on your code as follows:

import { LangiumDocument, URI } from 'langium';
import { NodeFileSystem } from 'langium/node';

import { Model } from '../../../gen/language/generated/ast.js';
import { createDataSpaceServices } from '../../language/data-space-module.js';
// Example code to extract `errorDiagnostics` from `code`.
const services = createDataSpaceServices(NodeFileSystem).DataSpace;
const document = services.shared.workspace.LangiumDocuments.createDocument(
    // We must provide an URI for the Langium document.
    // Since this document is not saved to the file system, we provide a fake URI.
    URI.parse('synthetic://model.dataspace'),
    code,
) as LangiumDocument<Model>;
await services.shared.workspace.DocumentBuilder.build([document], { validation: true });
const errorDiagnostics = document.diagnostics?.filter(e => e.severity === 1) ?? [];

✏️ Whenever the users enters /fix into the chat, send a prompt instructing the LLM to fix the errors in the model.

Tip

Look at how the /exit command is implementation. Note that the generated fix also contains code, which should be validated.

📋 CHECK: After compiling with npm run build, run the chat application on the dataspace-example/students.dataspace model. Ask for a simple modification (e.g., Create a new schema for student data aggregation that also includes an AI score?). Ask for fixes with /fix until you get a valid model. Save the transcript into your repository.

How "intellingent" is the LLM after all? Can it be a reasonable help for dataspace engineering? How could we improve its capabilitis? What are the common failure modes?

Code generation workflow

More complex LLM-based tools can be created by LLM orchestration, where multiple LLMs and multiple LLM calls and combined with other tools (such as validator). There are two main approaches for orchestration:

  • Workflows are chains of pre-determined calls to LLMs.
  • Agents use the LLM itself to drive the execution by prompting it to return the next step to be executed.

Agents can solve more complex tasks, but can be brittle, because they rely on the LLM for step execution order.

In the following we will create a workflow for generating dataspace models from textual specifications. The following diagram illustrates the workflow.

graph TD
    A[PDF Extraction] --> B
    
    subgraph Reasoning_LLM[Reasoning LLM]
        B[Specification Summarization]
    end
    
    subgraph Coding_LLM[Coding LLM]
        C[Schema Generation]
        D[Stakeholder Generation]
        E[Service Chain Generation]
        F[Mapping Generation]
    end
    
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G[Generated Code]
Loading

We will extract the specification from PDF files. Then, we will use a reasoning LLM model with a large context window and powerful inference capabilities to summarize the specifaction. The summary will serve as input for a cheaper and faster coding LLM that generates the different parts of the dataspace model.

We split the reasoning and coding parts of the workflow into two separate commands so that you do not have to wait for the reasoning LLM every time

Task 2: Specification summarization (2 points)

🔍 Open the src/cli/ai/extract.ts file and study how we would like to extract a written specification from PDF file.

The role of this program is to take all the PDF files in the dirName directory, and create a short natural language (textual) summary, which is saved to the outPath file.

The chat application uses a "reasoning" LLM, deepseek/deepseek-r1, which generates and discards many "reasoning" tokens during output generation to further process the input. You can try similar LLMs like openai/o1-mini or anthropic/claude-3.7-sonnet:thinking.

💡 Recently, many libraries for LLM workflow and agent orchestration have become available, such as LangChain, LangGraph, AutoGen, and smolagents. For simplicity, we will use manual orchestration and not rely on any library, as it is recommended for simple applications. Nevertheless, we will use the PDF extraction capabilities of LangChain to read the information from the OHV regulations.

💡 The LLM that we use has a large enough context window so that the full OHV regulations fit. For even larger documents, orchestration frameworks employ techniques like Retrieval-Augmented Generation (RAG) to let the LLM select the parts of the documents that should be added to the context window.

🔍 Study the LangChain PDFLoader documentation, especially the parts on reading PDF files from a directory.

✏️ Implement the PDF file extraction from the dirName directory in the src/cli/ai/extract.ts file. Concatenate all PDF files into a single prompt.

Tip

By default, LangChain will split PDF files into separate pages. You can disable this by passing the { splitPages: false } option.

✏️ Complete the system prompt with a description of the main concepts in the dataspace modeling language. Do not include any code examples, as the role of this LLM in the workflow is to generate a concise English summary of the input instead of generating code.

We will use the generated summary in Task 3 with a different LLM to generate syntactically valid code.

📋 CHECK: After compiling with npm run build, run the application with node dist/cli/main.cjs extract specification extracted.txt to extract a summary from the PDF files in the specification directory. Save the resulting extracted.txt into your repository.

What do you think about the specification? Does it look reasonable? How could we make the LLM improve it? Is this a viale approach for summarizing a large amound of documentation?

Task 3: Code generation with structured output (3 points)

Structured output capabilities of LLMs allow us to pass a JSON schema to constrain the generated output to specific kinds of JSON files. Since we can parse the resulting JSON, this allows us to generate arbitrary structured data with LLMs.

In this exercise, we will use structured output to ensure that the LLM generates output that conforms to the abstract syntax of our dataspace modeling language. Since we want to use textual concrete syntax, we also implement a code generator that turns the abstract syntax into the concrete one (as not such tool is generated by Langium out of the box).

The chat application uses a relatively lightweight and fast LLM, openai/gpt-4o-mini, but you can try switching to a dedicated "coding" LLMs like mistralai/codestral-2501 or a bigger LLM like anthropic/claude-3.7-sonnet. Do not use a "reasoning" LLM, as it would slow down this step of the workflow too much.

This step uses the same system prompt as the chat application from Task 1, as the LLM will need to complete description of the dataspace language to generate code.

Instead of writing JSON schemas directly, we will use a TypeScript library called zod. The OpenAI client library has a helper function to turn zod schemas into JSON schemas, but zod schemas also come with TypeScript definitions to make interacting with them more pleasant.

🔍 Open the src/cli/ai/generate.ts file and study how we use zod schemas to generate schemas and stakeholders from the natural language summary of the dataspace specification.

🔍 Study the zod documentation on defining your own schemas.

💡 We prompt the LLM to only use valid TypeScript identifiers in the names of data space schemas and stakeholders. However, it may still sometimes generate an invalid identifier. To remedy this, this use the transform operator of zod to remove invalid characters with a custom sanitizeName helper function.

✏️ Define a new zod schema for all the service chains in the dataspace. At this step, do not include any mappings in the schema.

✏️Also write the corresponding prompt, and make the LLM generate the service chains with structural output. Print the resuling JSON file to the console.

📋 CHECK: After compiling with npm run build, run the application with node dist/cli/main.cjs generate extracted.txt to print the service chains (without mappings) in JSON format to the console. Save the output to your documentation.

Do the generated service chains look reasonable? How could we use the structured output from the previous workflow steps (schemas, stakeholders) to improve them?

✏️ Define a new zod schema for a mapping between to service chain schemas. Make sure to include the possibility of pseudonymization.

✏️ Also write the corresponding prompt, and create a helper function that generates a mapping between to service chain schemas with the LLM.

✏️ Iterate over the service chains generated by the LLM, and use your helper function to generate a mapping whenever the actual and expected input schemas differ. Print the generated mappings in JSON format to the console.

At this step, you can ignore situations where the actual and expected input schemas of a service chain step match, but we have to create a schema mapping to pseudonymize PII.

Tip

This is where structured output becomes especially useful, as we can use the structured output from the stakeholder generation to determine the input and output schemas of service chain steps.
You may run into a situation where the LLM refers to a non-existing (hallucinated) stakeholder or service. Here, we could ask the LLM to correct itself, but for simplicity, we will just avoid generating any mapping for such sitations.

You may see that mapping generation does not need all previous generated mappings in its context to function. Therefore, you can use the context from service chain generation and extend it with a single user message every time, which enable parallelizing mapping generation requests.

📋 CHECK: After compiling with npm run build, run the application with node dist/cli/main.cjs generate extracted.txt to print the service chains and mappings in JSON format to the console. Save the output to your documentation.

Do the generated mappings look reasonable? How could we use the structured output from the previous workflow steps to improve them? Is this a viable approach for turning a specification into a formal model?

Extra task: Agent orchestration (3 IMSc points)

So far, the generation of dataspace models was completely driven by our fixed workflow, with the structured output guaranteeing syntactic correctness.

The aim of this task is to create an agent, where the LLM orchestrates the generation steps.

Instead of using structured output, we will make the LLM output code (textual concrete syntax) directly. We will give access to the LLM to a tool that invokes the Langium dataspace validator and reports back

🔍 Study the function calling documentation of the OpenAI client library. You can look at further examples in the OpenRouter documentation and the LangChain documentation.

✏️ Construct a tool that invokes the dataspace validator. A very simple input schema will do, such as passing the whole source code in the string attribute of a JSON object (but schemas must be objects for structured output and tool calling to work).

✏️ Write a prompt that instructs the agent to generate a valid dataspace model from a textual specification by invoking the validator as needed.

Tip

You may need to use a bigger LLM like anthropic/claude-3.7-sonnet for optimal performance.

Warning

Make sure to print all interactions to the console so you can see whether the LLM has got stuck, burning up tokens, but making no progress at all. In this case, terminate it and try again.

📋 CHECK: Run the agent to generate a dataspace model from the extracted.txt specification. Save the resulting transcript to your documentation.

How effective is the agent at using the tools? Is this a viable approach for code generation?

⚠️ **GitHub.com Fallback** ⚠️