Fine‐tuning LLMs for Complex Workflows via AppWorld - minalee-research/cs257-students GitHub Wiki

#application, #task, #tool-calling

Preston Thiele

Abstract

Large language models (LLMs) hold promise for automating complex, real-world tasks via APIs. In my project, I evaluate how well the Llama-3.1-8B-Instruct model performs on the AppWorld benchmark, which simulates real applications and provides a benchmark consisting of tasks along with unit tests for evaluation of model performance. I utilize a ReAct (Reasoning and Action) agent framework and explore how LoRA (Low-Rank Adaptation) fine-tuning on a dataset of outputs from a larger Llama-3.1-70B-Instruct model improves agent performance on the AppWorld benchmark. Further, I experiment with zero and one-shot prompting techniques on the base Llama model to explore how they impact performance. My findings suggest that prompting techniques alone provide insignificant performance increase on AppWorld tasks. However, fine-tuning on synthetic data distilled from the larger Llama model substantially improves model performance on tasks, reducing syntactic errors and hallucinations in the model's code outputs.

What this project is about

The emergence of LLMs has led to the exploration of ways that tasks can be solved via “agentic” frameworks. Some of these frameworks, such as OpenAI’s Operator and Anthropic's Computer Use, rely on automating browser interactions. They provide LLMs the ability to click through interfaces and webpages similar to how a human would. Other frameworks involve giving these agents access to tools, functions, or APIs, allowing them to interact with tools programmatically. I am interested in exploring how LLMs can better utilize the vast number of APIs that currently exist. I believe this approach has latency, determinism, and simplicity advantages when compared to the browser automation approach.

overview-engine-benchmark
My project utilizes the AppWorld environment and benchmarks. AppWorld provides simulated APIs for real-world applications such as Venmo, Amazon, and Spotify. AppWorld also provides sets of tasks that require the agent to interact with the various AppWorld apps. Here is an example of a task in AppWorld: I am going for a half-hour walk without internet. Play a playlist from my Spotify library that already has enough downloaded songs for it, so I do not have to repeat. Each task comes with unit tests to check whether the task was successfully completed by verifying against a mock database. LLMs currently excel in areas such as text summarization and translation but struggle when applied to more practical real-world tasks (sending Venmo payment requests, gathering data from Spotify, sending emails, etc.). By enabling an AI agent to retrieve relevant API documentation and incrementally make progress towards a task, agents are able to more accurately perform complex workflows. In this project, I first develop a ReACT agent framework that allows an agent to generate code, view its output, and iteratively make progress towards a goal. I work with the Llama-3.1-8B-Instruct to see how different fine-tuning and prompting methods improve performance of the model.

Experiments:

Zero-Shot Llama-3.1-8B-Instruct
One-Shot Llama-3.1-8B-Instruct
One-Shot THUDM/agentlm-7b
Llama-3.1-8B-Instruct (LoRA) Fine-Tuned on Llama-3.3-70B-Instruct Task Outputs (Distilled Outputs)

I evaluate the performance of these selected models via the AppWorld environment to see how well they handle multi-step reasoning and API usage. Outside of task success/failure, I record statistics such as the number of iterations to complete a task, average number of tokens to complete a task, and the agent's perceived completeness of the task.

Progress

Approach

ReAct Agent Framework
Tasks in the AppWorld environment often require multiple API calls, error handling, and mid-task corrections. To address this complexity, I employ a ReAct (Reasoning and Action) workflow, where the model receives an initial prompt describing the available APIs and is instructed to respond in Python code. This code is executed in the AppWorld environment, and the resulting output is appended to the model’s context before the next iteration. This iterative loop continues until the model either completes the task or reaches a predefined maximum number of iterations. By letting the model observe and refine its own outputs, ReACT improves interpretability and reliability on multi-step tasks. In the AppWorld Leaderboard shown below, the most successful framework varies by LLM. However, the ReAct GPT-4o model is currently the most successful model-framework combination.
Prompt Engineering
I test two different prompting techniques on the Llama-3.1-8B-Instruct model to determine if prompting alone can improve task performance. Specifically, I test zero-shot and one-shot prompts. Both prompts provide the LLM with an overview of the AppWorld APIs available to the agent and what tools to call for more documentation. After experimentation, I ended up having to add more specific guidance to the model to mitigate errors such as the model not outputting Python code and generating the output of the code itself. Initially, I had aimed to test a few-shot prompt as well. However, due to the amount of tool overview information required for the model to understand the AppWorld environment, I was unable to fit a few-shot prompt into the context window.

Fine-Tuning
In order to determine if the performance of a smaller model can be improved to the level of larger models I apply LoRA (Low-Rank Adaptation) fine-tuning to the Llama-3.1-8B model. To preserve the validity of the benchmark, AppWorld tries to limit the amount of information available online about its benchmarks. Thus, there are no publicly available datasets of AppWorld task interactions. Further, as agent tool use is relatively novel, I was unable to find any comparable datasets online that I could use for fine-tuning. Thus, I chose to create my own dataset by distilling the outputs from the larger Llama-3.3-70B-Instruct model. From the successful task completions, I collect (task, iterations) pairs that demonstrate correct progression towards task completion. Fine-tuning on this data appears to reduce syntactic errors and improve the smaller model’s ability to call APIs accurately. Importantly, all synthetic data is generated using the training and dev sets of AppWorld tasks, avoiding any overlap that could cause data leakage from the distilled task executions.

Baselines
For benchmarking, I evaluate the original Llama‑3.1‑8B‑Instruct with zero‑shot and one‑shot prompting. I also benchmark on the agentlm-7b model which is comparable in size and was instruction-tuned for agent tasks. Although AppWorld provides a leaderboard of frontier models, they contain many more parameters than the Llama‑3.1‑8B‑Instruct model, making comparisons with these models not very insightful. Nevertheless, the leaderboard results show the difficulty of complex task solving for LLMs. Even the SOTA models achieve less than 50% success on the AppWorld tasks.

Screenshot 2025-02-25 at 1 48 46 AM

Experiments

AppWorld Tasks

The AppWorld environment provides a benchmark for evaluating LLMs’ ability to use APIs to complete complex workflows. It contains 750 tasks across three difficulty levels that require agents to interact with simulated applications through API calls. Thus, these tasks are comparable to real-world scenarios that agents might be assigned such as sending emails, scheduling appointments, or transferring money. Each task in AppWorld has a specific goal and validation criteria that is checked by unit tests. It is reasonable to assume that success in AppWorld tasks translates to success in completing real-world tasks. This is because many of the AppWorld applications mimic apps like Amazon and Spotify that are found in the real world. I select a subset of 15 tasks from the test_normal task subset for my experimentation. I initially intended to use a larger set of tasks for my evaluation but ran into compute and memory limitations when trying to evaluate on a larger set of tasks. The authors of the AppWorld paper noted that they spent approximately $10K on their experimentation!

Synthetic Data Generation

Since AppWorld does not provide a dataset for training but rather an evaluation environment, I had to develop my own synthetic dataset for fine-tuning. To do this, I used the Together API with the Llama-3.3-70B-Instruct model. I hypothesized the larger Llama model would generate higher-quality examples that could be used to teach the smaller Llama model. It is worth noting that the Llama-3.3-70B with a similar ReAct framework only has a 20.8% completion rate on the AppWorld test_normal benchmark. Thus, I made sure to remove unsuccessful task iterations from my synthetic dataset. To generate the outputs I use the same ReAct setup as I do for evaluation with the hyperparameters shown below. Note that I set iterations to 20 for data generation rather than 10 to have more tokens in my dataset.

The generation process works as follows:

For each AppWorld training and dev task, I prompt the Llama-3.3-70B with the task description
The model generates Python code to solve the task
The code is executed in the AppWorld environment
The execution output is captured and appended to the model's context
The process continues for multiple iterations until task completion or a maximum # of iterations reached
If the task is successfully executed, I append it to my dataset.

Hyperparameters:

20 iterations per task
512 tokens per iteration

Fine-Tuning on Synthetic Data

Using the synthetic data, I implemented Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) on my Llama-3.1-8B-Instruct model. I chose this combination due to compute limitations and the relatively large size of my Llama model.

Fine-tuning configuration:

GPU: gpu_1x_gh200
LoRA rank: 16
LoRA alpha: 32
Learning rate: 2e-4
Training epochs: 3
Batch size: 4

Benchmark Evaluations

For baseline evaluations, I tested several models on the AppWorld tasks:

Llama-3.1-8B-Instruct (zero-shot)
Llama-3.1-8B-Instruct (one-shot)
AgentLM-7B (one-shot)

The primary metric I use to evaluate model performance is task success rate. However, I also look at a variety of other factors including token usage and number of iterations. The zero-shot baseline experiments with Llama-3.1-8B-Instruct showed a success rate of 0% on the initial 15 tasks, highlighting the difficulty of these tasks without specific fine-tuning. This is not surprising when one considers both the performance of the larger SOTA models and the lack of examples in zero-shot prompting. This aligns with expectations as the AppWorld tasks require specialized knowledge of API interactions and complex reasoning that isn't captured in general instruction tuning.

Fine-tuned Model Evaluations

The fine-tuned model (Llama-3.1-8B-Instruct with LoRA fine-tuning on synthetic data) and benchmark models were all tested on the same set of 15 AppWorld tasks from the test_normal task subset. As shown and discussed further below, the fine-tuned model shows a significant improvement over the zero-shot and few-shot baselines as well as the agentlm-7b model.

Model	Success Rate	Completion Rate	Avg. Iterations (max 10)	Avg. Tokens per Task
Llama-3.1-8B-Instruct (zero-shot)	0.00%	6.67%	10	4972.53
Llama-3.1-8B-Instruct (one-shot)	6.67%	20.00%	9.93	3397.4
THUDM/agentlm-7b (one-shot)	0.00%	0.00%	10	2151.73
Llama-3.1-8B-Instruct Fine-Tuned (one-shot)	20.00%	40.00%	9.53	4759.33

Success Rate

The success rate of the model is defined by the percentage of task executions that passed all the unit tests for the given task. The success rate of the fine-tuned Llama-3.1-8B-Instruct model is 20.00%. This is a significant improvement over the Llama-3.1-8B-Instruct zero-shot and one-shot baselines. Thus, the model seemed to be able to successfully interpret the examples shown to it during the fine-tuning process. Further, this success rate is comparable to the success rate of the Llama-3.3-70B-Instruct model on the larger AppWorld benchmark. One important consideration when interpreting this comparison is the task difficulty of the subset of tasks used for my evaluation might not be equivalent to the task difficulty of the entire AppWorld benchmark. Nonetheless, the success rate of the fine-tuned model shows promise for the ability of smaller models to be fine-tuned for specific tasks goals.

Completion Rate

The completion rates of the models roughly follow the success rates. The completion rate is defined to be the percentage of tasks where the model believed it had completed the task. For example, the fine-tuned Llama-3.1-8B-Instruct model has a completion rate of 40.00% while the success rate is 20.00%. Thus, of the tasks the model believed it had completed, only about half were actually done successfully.

Average Iterations per Task

The average number of iterations per task is roughly constant across all models. However, the fine-tuned model does have a slightly lower average number of iterations than the other models. This makes sense as the fine-tuned model likely had less erroneous iterations that needed to be corrected when completing a task.

Average Tokens per Task

The average number of tokens per task is highest for the zero-shot Llama-3.1-8B-Instruct model. I hypothesize that this is because the model has a less clear understanding of the task and thus generates more tokens to try to complete the task. The agentlm-7b model has the lowest average number of tokens per task, likely because of difference in model architectures.

Qualitative Observations

From a qualitative standpoint, there were a few clear errors made by all of the models. These errors occurred in each of the models but were less frequent in the fine-tuned model. They are as follows:

Real-World API Usage: The model would often call the real-world API for the simulated application. For example, the model would call the actual Spotify API when it should have used the AppWorld Spotify API documentation. This error is hard to mitigate as the models have all been trained on the real-world API documentation. This is an error that the models will likely not exhibit in real-world applications as they will be interacting with the real-world APIs they have been trained on.
Syntax Errors: These models in general had a difficult time generating syntactically correct Python code. Viewing the synthetic dataset, the Llama-3.3-70B-Instruct model had this problem much less frequently. Thus, I suspect this error is generally attributed to the smaller size of the Llama-3.1-8B-Instruct and the agentlm-7b models.
Hallucinations: The model would often hallucinate the output of its generated code. For example, if the model's output had code to call the get_playlist_tracks function, it would often hallucinate and output its own list of tracks.

Future Work

There are many ways this project could be expanded upon in the future. I would like to evaluate the performance of the fine-tuned model on a larger subset of the AppWorld tasks to verify my results. Further, I would like to give the agents a larger number of iterations to complete the tasks to study how well they are able to correct their own errors. Both of these extensions rely on more compute resources. Finally, I would like to experiment with orchestration of multiple agents to see if they are able to correct each other's errors.

References

Trivedi, Harsh, et al. "AppWorld: A controllable world of apps and people for benchmarking interactive coding agents." arXiv preprint arXiv:2407.18901 (2024).

(The content is based on Stanford CS224N’s Custom Final Project.)