Papers - 0ca/BoxPwnr GitHub Wiki

Papers

Hacking the Hacker: How AI Agents are Changing the Game of Penetration Testing

https://www.nctatechnicalpapers.com/Paper/2024/AI06_Haefner_6481_paper

Notes

They evaluated different models: Llama3, Dolphin-2.9 and GPT4o. They found GPT4o to perform the best. They also tried different architectures for agents:

  • Two agents
  • Central Coordinator Model
  • Team Lead Model

They found that two agents works best that many specialized agents. But I wonder if this is a side effect of the prompting they were doing or the model (Are future models going to work better with more specialized agents?). They didn't test 1 single agent (That is what BoxPwrn is doing now). I wonder if two agents would work better than just one, I expect so, but I also would expect many specialized agent would work better than just 2.

Based on the output they shared, they used at least 2 machines from Starting Point: Meow & Fawn. But they don't disclose the full list and the statistics per machine, only the following:

The authors ran the LLMs against ten Hack-The-Box challenges and the models were able to complete five of them.

Their success rate with Gpt4o for Meow was 43%.

Very interesting observation:

This pattern and success rates held true even if the number of turns increased from 12 to 24. The agents just don’t seem to be able to “step back” and determine the problem. I have also observed a lack of stepping back. Hopefully we can address it with reasoning models.

I haven't ran into hallucinations yet (what surprises me). I bet it's because we are using a newer release of the same model. Or maybe something magic I have in the prompt? Also I hit guardrails at the beginning but after adding to the prompt that I was authorized to do the test I haven't hit them again.

Overall, it's an inspiring paper, but I miss more detailed statistics, more information about the iterations they do to reach the prompt they are using and of course I miss that they didn't publish the source code.

PentestGPT: An LLM-empowered Automatic Penetration Testing Tool

https://arxiv.org/abs/2308.06782 submitted on 13 Aug 2023

Recent papers / work on AI and hacking by Timothee Chauvin

This is an amazing list and much better that this wiki article 1000% recommended.

https://tchauvin.com/recent-papers-ai-hacking

Hacking CTFs with plain agents

https://arxiv.org/abs/2412.02776

LLM Agents can Autonomously Hack Websites

https://arxiv.org/abs/2402.06664

An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

https://arxiv.org/abs/2402.11814

LLM Agents can Autonomously Exploit One-day Vulnerabilities

https://arxiv.org/abs/2404.08144

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

https://arxiv.org/abs/2406.01637

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

https://arxiv.org/abs/2406.05590

Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

https://arxiv.org/abs/2409.16165

EnIGMA: Enhanced Interactive Generative Model Agent for CTF Challenges

https://enigma-agent.com/assets/paper.pdf

HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

https://arxiv.org/abs/2412.01778

Notes

They present a two model design, planner and summarizer. The summarizer runs the commands and summarizes them. The summary is injected in the prompt in each call:

One issue with this approach is that you can't cache the prompt/conversation since it's modifying the system prompt every time with the updated summary. But it reduces significantly the context used, so probably helps.

All the code is public :) https://github.com/aielte-research/HackSynth/tree/main

They also experiment with different temperature, top-p value and observation window.

The latest model they tested was gpt4o. So I wonder how this apply to new models with reasoning and longer context.

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

https://arxiv.org/abs/2408.08926

InterCode: Standardizing and Benchmarking. Interactive Coding with Execution Feedback

https://arxiv.org/pdf/2306.14898