Lakera / Check Point Software
"Crash test" for LLMs in AI agents
Lakera and the UK AI Security Institute have launched 'b3 ', a new open source benchmark. b3 is an open source security evaluation project specifically designed to protect Large Language Models (LLM) in AI agents.
The benchmark b3 was built on the basis of the new idea called Threat Snapshots. Instead of simulating a complete AI agent from start to finish, the threat snapshots zoom in on the critical points where vulnerabilities in LLM frequently occur.
By testing the models at these specific points, developers can see how robust their systems are against attacks - without the complexity that was previously required to model a complete agent workflow. A kind of 'crash test' for AI agents.
"We developed the b3 benchmark because today's AI agents are only as secure as the LLMs that fuel them," explains Lakera co-founder Mateo Rojas-Carulla. "These threat snapshots allow us to systematically look for vulnerabilities on the attack surface that were previously hidden in the complex agent workflows."
b3 combines ten representative threat snapshots with 19,433 real cyberattacks from the gamified red-teaming game 'Gandalf: Agent Breaker'. Among other things, prompt exfiltration, phishing link injection, malicious code injection, DoS and unauthorized tool calls are evaluated.
The first tests with 31 common LLM models show:
- Better reasoning capabilities increase security
- Model size does not correlate with security performance
- Closed source performs better on average, but top open models catch up
The benchmark report is available under an open source license: https://arxiv.org/pdf/2510.22620
| Gandalf: Agent Breaker is a hacking simulator game in which you are challenged to crack and exploit AI agents in realistic scenarios. The ten GenAI applications in the game simulate the behavior of a real AI agent. Each app features multiple difficulty levels, layered defenses and novel attack surfaces that challenge a range of skills, from prompt engineering to red teaming. Some of the apps are chat-based, while others rely on code-level thinking, file processing, memory or the use of external tools. |











