Anamnesis: When AI Agents Learned to Chain glibc Exit Handlers to Bypass Shadow Stacks
Hook
GPT-5.2 just autonomously defeated Intel’s CET Shadow Stack protection by chaining glibc exit handlers—a technique that requires understanding dozens of interconnected glibc internals, pointer mangling schemes, and control flow restrictions. No human gave it hints about FSOP attacks or link_map traversal. It figured it out through 50 million tokens of trial and error.
Context
Exploit development has always been the domain of elite security researchers who spend years mastering arcane details of memory layouts, calling conventions, and mitigation bypasses. Writing an exploit that defeats modern protections like CFI, RELRO, and Shadow Stack requires chaining multiple techniques, understanding subtle interactions between libc internals, and iterating through dozens of failed attempts. It’s painstaking work that seemed fundamentally resistant to automation—symbolic execution systems struggle with the creative leaps required, and traditional fuzzing can’t reason about multi-stage exploitation strategies.
Anamnesis is an evaluation framework from security researcher Sean Heelan that answers a question the security community has been nervously asking: can frontier LLMs autonomously generate working exploits that bypass state-of-the-art mitigations? Not just crash a program or trigger a vulnerability, but chain primitives into full exploits that defeat ASLR, PIE, CFI, and Intel CET Shadow Stack. The framework provides a controlled environment with vulnerable QuickJS binaries compiled under progressively harder mitigation configurations, then turns LLM agents loose with debugging tools and compilation capabilities to see if they can independently discover exploitation techniques. The results are both technically fascinating and security-concerning: Claude Opus 4.5 and GPT-5.2 successfully generated working exploits for all configurations, including defeating Full RELRO + CFI + Shadow Stack simultaneously.
Technical Insight
Anamnesis’s architecture is built around reproducibility and isolation. The framework runs entirely in Docker containers, providing agents with a standardized Ubuntu environment containing vulnerable QuickJS binaries compiled with different mitigation combinations. Each agent receives a vulnerability report describing a type confusion bug in QuickJS’s array handling, a proof-of-concept trigger that crashes the program, and access to a set of tools: a Python script to compile and test exploits, GDB for binary analysis and debugging, and standard Unix utilities for inspection.
The agent interaction model is what makes this interesting. Agents don’t receive pre-written exploitation primitives or templates. They start with just the vulnerability description and must figure out everything else: how to turn the type confusion into arbitrary read/write, how to leak addresses to defeat ASLR, which data structures to corrupt to gain control flow, and how to chain techniques when direct approaches are blocked. The framework integrates with Claude’s Computer Use SDK and OpenAI’s agent SDK, providing a simple interface where agents can execute shell commands and receive output. Here’s a simplified example of how an agent might interact with the framework:
// Agent receives vulnerability context
const vulnReport = {
type: "Type Confusion",
location: "quickjs.c array handling",
primitive: "Can confuse JSValue types, treating object as array",
poc: "poc.js triggers crash via type confusion"
};
// Agent uses GDB tool to analyze memory layout
await agent.executeCommand("gdb -batch -ex 'run poc.js' ./quickjs");
// Observes crash, identifies corrupted pointer
// Agent writes exploit iteratively
await agent.writeFile("exploit.js", `
// Craft fake JSObject to gain arbitrary read
let fake_obj = /* ... agent's exploitation logic ... */;
// Leak libc address by reading GOT entry
let leaked_addr = read64(got_entry);
`);
// Agent compiles and tests
const result = await agent.executeCommand("python3 test_exploit.py exploit.js");
// Receives feedback: "ASLR bypassed, but crashed on CFI check"
// Agent adapts strategy based on failure
await agent.executeCommand("objdump -d quickjs | grep '<__stack_chk_fail>'");
// Discovers CFI restrictions, pivots to FSOP approach
The really remarkable technical achievement documented in Anamnesis is how agents handled the hardest challenge: Full RELRO + CFI + Shadow Stack. With RELRO, the GOT is read-only, blocking traditional GOT overwrites. CFI restricts indirect calls to valid targets with matching signatures. Shadow Stack prevents ROP by maintaining a hardware-protected return address copy. This combination seems insurmountable—you can’t hijack control flow the traditional ways.
GPT-5.2’s solution, discovered autonomously across multiple runs, was to exploit the glibc exit handler chain. When a program exits normally, glibc calls functions registered with __cxa_atexit and functions in the __exit_funcs list. By corrupting these data structures through the type confusion primitive, the agent could cause the program to call arbitrary functions with controlled arguments—but only functions with signatures matching the exit handler prototype. The agent then chained multiple exit handlers: first calling setcontext (valid CFI target, signature matches) to pivot registers to controlled memory, then invoking syscalls through __libc_dlopen_mode to disable the sandbox. This doesn’t use ROP (blocked by Shadow Stack) or indirect calls to arbitrary addresses (blocked by CFI)—it’s pure data corruption leading to a sequence of legitimate function calls that together achieve code execution.
The framework’s logging is exceptionally detailed. Every agent interaction is recorded with full reasoning traces showing how the agent analyzed failures, consulted documentation, and refined its approach. One trace shows an agent spending 12 million tokens working through pointer mangling—glibc XORs function pointers with a secret before storing them, and the agent had to figure out how to leak the mangling key from thread-local storage, then correctly demangle and remangle pointers to corrupt the exit handler chain without crashing. No human intervention occurred during these runs.
The experimental setup tests multiple mitigation configurations: baseline (ASLR only), NX + PIE, Full RELRO, CFI, Shadow Stack, and combined configurations. Each configuration requires different exploitation strategies. The framework runs 10 independent attempts per configuration per model, recording success rates, token consumption, and time to completion. For the hardest challenge (Full RELRO + CFI + Shadow Stack + Sandbox), GPT-5.2 achieved 60% success rate across 10 runs, with successful exploits consuming 30-60 million tokens and taking 3-5 hours of wall-clock time.
Gotcha
The elephant in the room: 10 runs per experiment is nowhere near enough for statistical significance. Exploit generation involves substantial randomness in agent reasoning paths, and with success rates between 40-70% on hard challenges, you’d need hundreds of runs to confidently distinguish model capabilities. The provided results are suggestive and technically impressive, but they’re not definitive proof of consistent capability. You might replicate these experiments and see very different success rates due to randomness in agent exploration strategies.
Anamnesis is also tightly coupled to the specific QuickJS vulnerability and mitigation configurations Heelan selected. This isn’t a general-purpose exploit generation framework you can point at arbitrary binaries. The vulnerability is a type confusion bug with specific characteristics (arbitrary read/write primitive, ability to craft fake objects), and exploitation strategies that work here might not transfer to use-after-free bugs, heap overflows, or race conditions. The framework provides excellent experimental controls for studying LLM capabilities on this particular challenge, but extrapolating to general exploit generation requires caution. The Docker environment, tool integration, and evaluation scripts are all purpose-built for this scenario. Adapting Anamnesis to study different vulnerability classes would require substantial engineering work—you’d need new vulnerable binaries, different tooling setups, and modified agent prompts. It’s a research artifact demonstrating what’s possible, not a plug-and-play system for evaluating LLMs on arbitrary exploitation tasks.
Verdict
Use Anamnesis if you’re researching offensive AI capabilities and need reproducible experiments showing current frontier model performance on complex exploitation tasks, if you’re in AI safety and want concrete examples of dual-use capabilities that have crossed meaningful thresholds, or if you’re studying agent reasoning on deeply technical problems where success requires chaining dozens of insights. The detailed logs showing 50M token reasoning traces are invaluable for understanding how LLMs approach multi-stage technical challenges. Skip if you need practical tooling for day-to-day security work—this is an evaluation framework, not an operational exploit generator. Also skip if you want general-purpose automated exploitation across diverse vulnerability types; Anamnesis is intentionally focused on one carefully selected challenge. The primary value is benchmarking and understanding, not production security testing.