Most AI agents reread transcripts instead of learning principles, making mistakes repeatedly and failing to transfer lessons to new situations. ALTK‑Evolve converts raw agent trajectories into reusable guidelines. In our benchmarks, this approach improved reliability without bloating the context, especially on hard (Δ 14.2% on AppWorld) multi-step tasks.
“Eternal Intern” Problem
Imagine a great cook who knows every cookbook by heart but forgets about the kitchen every morning. They don’t remember that your oven gets hot or that patrons like extra salt. I follow the recipe card, but when I run out of lemons, it freezes. This is most AI agents. They are good at following prompts, but bad at accumulating knowledge about their environment. If I put yesterday’s log back into the prompt, it just rereads the history. It doesn’t help them generalize from there.
Junior needs different recipes for “vinaigrette” and “duck orange”. Chefs learn that “acid balances fat” and apply it everywhere. Similarly, a reliable agent must extract principles from experience and apply them to new tasks, not just those that are close to replicating old tasks. This long-term memory subsystem does just that. Convert interaction traces into candidate guidelines and filter quality to insert only guidance relevant to the moment of action. Agents need principles, not records.
A recent MIT study found that 95% of pilots fail because agents don’t adapt and learn in the field. ALTK-Evolve uses long-term episodic memory to address this learning gap, allowing agents to reason better.
Solution: Long Term Memory with ALTK-Evolve
Evolve is a memory system for AI agents that helps them improve over time by learning from and using guidelines generated from previous runs.
Operationally, the system runs as a continuous loop.
Downward flow (observation and extraction): Capture the complete agent trajectory (user utterances, thoughts, tool calls, and outcomes) at the interaction layer (such as Langfuse or another OpenTelemetry-based observability tool). Pluggable extractors mine traces of structural patterns and persist them as candidate entities. Upward Flow (Refinement and Retrieval): Background integration and scoring jobs merge duplicates, remove weak rules, and strengthen proven strategies to evolve a high-quality library of entities such as guidelines, policies, and SOPs. Retrieval retrieves only the relevant items through the interaction layer and returns them to the application layer context.
This approach works for several main reasons.
Teach judgment: Transform one-time events into portable strategies that can be transferred between tasks. Control Noise: Scoring keeps your memory lean and useful and prevents your junk drawer from growing. Progressive disclosure: Search is just-in-time and doesn’t cram everything into context.
The result: increased reliability, especially in difficult tasks.
We evaluated AppWorld’s framework. There, agents complete realistic multi-step tasks through APIs, with an average of 9.5 APIs used per 1.8 apps, but there are also hard cases that require more complex control flows. The ReAct agent received task instructions and the top five retrieval guidelines generated in a previous run (train/dev) and tested on an unseen partition (test-normal). Report Scenario Objective Completion (SGC). This is a strict consistency measure that requires success across variants.
Difficulty Baseline SGC + Memory Δ Easy 79.0% 84.2% +5.2 Medium 56.2% 62.5% +6.3 Hard 19.1% 33.3% +14.2 Overall 50.0% 58.9% +8.9
Key conclusions from the evaluation include:
Generalization: Agent improves invisible Test-Normal tasks. This indicates that the agent is learning principles rather than memorizing recipes. Complexity scaling: The more difficult the task, the more easily the agent can benefit from learned guidelines, with more difficult tasks having the greatest benefit. Hard tasks had a 74% relative increase in success rate. Guidelines help navigate complex control flows. Consistency: The SGC improvement exceeded the actual pass rate improvement, reducing “erratic” behavior across scenario variations. Guidelines not only help agents solve tasks, they also help ensure that tasks are solved across variants.
For more details on the experiment, please see the paper at https://arxiv.org/abs/2603.10600.
Introduction (Please select a path)
You can choose how ALTK‑Evolve is integrated into your agent.
No Code (Lite Mode) by Claude Code, Codex, IBM Bob
Install the plugin in Claude Code.
Claude Plugin Marketplace Add AgentToolkit/altk-evolve Install Claude Plugin evolve-lite@evolve-marketplace
that’s it! The plugin extracts entities from the trajectory and saves them as files on the file system. Use Claude Code’s hook for automatic retrieval.
Prefer watching than reading? Watch a short tutorial (video) on Evolve-Lite Claude Code: Demo
For an example of how to learn using Claude Code in Lite mode, check out the walkthrough here.
Light mode is easy to test, but it has limitations. For example, it doesn’t collect insights from the entire agent session or perform entity consolidation or garbage collection. The low-code and pro-code versions below address these limitations.
There is also one-step integration with Codex and IBM Bob. Please try it!
Low-code with ReAct agent
Add a single altk_evolve.auto import, invert the flag, and output the trace to the Arize Phoenix UI. Then, synchronize the traces to generate improvement guidelines without changing the current stack. It works with popular LLM client and agent frameworks (OpenAI, LiteLLM, Hugging Face agent, etc.), so you can maintain your current stack and gain easy visibility.
To see how easily this fits into your existing projects, explore our hands-on examples that showcase the integration of different frameworks. See the Low-Code Tracing documentation for configuration and feature details.
Procode using CUGA
We integrated ALTK‑Evolve directly into CUGA via MCP to create a tight, low-overhead learning loop. The get_guidelines MCP tool is called before each run to reveal task-specific steering and reduce trial and error. After execution, CUGA sends back a structured execution trace via save_trajectory so Evolve can learn from what actually happened and improve future guidance. The result is improved integration over time while maintaining transparency, configurability, and ease of deployment.
Would you like a visual tour? Check out our CUGA integration tutorial: Video
Try it and let your agents know what you learn
Agents shouldn’t wake up every morning as interns. This approach helps you learn on the job. If you use Claude Code, Codex, or IBM Bob, try it out in a few minutes and see how it improves your agents.
Starring a repository helps others discover your project and gives you direct guidance on what to build next.
Watch the demo
Claude Code Walkthrough (Video): Demo OpenAI Codex Walkthrough (Video): Demo IBM Bob Demo Walkthrough (Video): Demo CUGA Integration Walkthrough: Video

