Hermes Agent: the self-improving agent that writes its own playbooks
TL;DR: Hermes Agent by Nous Research (MIT, February 2026) is a persistent agent runtime that creates reusable skills from experience, stores them, and loads them in future sessions. With 26,300 GitHub stars and pluggable memory backends as of v0.7.0, it’s the fastest-growing open-source agent framework of 2026. The catch: skill quality depends entirely on the base model’s reasoning ability, and the 2,200-character built-in memory limit forces lossy compression after about ten sessions.

Most AI agents have amnesia. You spend twenty minutes teaching one how to deploy your staging environment, close the session, and the next day it asks you the same questions. Every session starts from zero. The agent is capable, but it never learns.
Hermes Agent, released by Nous Research in February 2026, bets on a different architecture. Instead of forgetting everything between sessions, Hermes watches what it does, identifies patterns worth preserving, and writes them down as reusable skills. The next time a similar task appears, it loads the skill instead of reasoning from scratch.
That sounds like magic. It isn’t. The skill loop has real constraints that most of the 26,300 people who starred the GitHub repo haven’t hit yet. I want to explain how it actually works, and where it stops working.
What problem does Hermes actually solve?
Stateless agents treat every conversation as a first date. They have no memory of what worked, what failed, or who you are. RAG-augmented agents retrieve documents but don’t learn from their own execution. Fine-tuned models encode knowledge into weights but can’t adapt to individual users at runtime.
Hermes sits in a different category. Nous Research — founded in 2023 by Jeffrey Quesnelle, Karan Malhotra, Teknium, and Shivani Mitra — built it as a persistent agent runtime: a long-running process that maintains state, creates skills, and models the user across sessions. The agent frameworks landscape has options like LangChain and CrewAI for orchestration, but none of them ship with a built-in learning loop.
The framework is Python-based, supports six terminal backends (local, Docker, SSH, Daytona, Singularity, Modal), connects to 14+ messaging platforms including Telegram, Discord, Slack, and WhatsApp, and integrates with MCP servers for tool extensibility. It reached 22,000 GitHub stars within weeks of its February 2026 launch and hit 26,300 by early April, with over 240 contributors, MIT licensed, no telemetry.
How does the skill creation loop work?
The skill loop is the core architectural differentiator. Here’s the actual mechanism:
flowchart TD
A[Agent executes task] --> B{Task completed successfully?}
B -->|No| A
B -->|Yes| C[Evaluate: recurring pattern?]
C -->|No| D[Log to episodic memory]
C -->|Yes| E[Codify steps into named skill]
E --> F[Store skill as structured file]
F --> G[Index skill by topic + trigger conditions]
G --> H[Future task arrives]
H --> I{Matching skill exists?}
I -->|Yes| J[Load skill, execute with current context]
I -->|No| A
J --> K{Skill performed well?}
K -->|Yes| L[Increment confidence score]
K -->|No| M[Revise skill via reflection]
M --> F
After roughly every 15 similar tasks (per Geeky Gadgets, March 2026), Hermes triggers a reflection cycle. The agent examines what it did, extracts the reusable pattern, and writes it as a structured skill file. Skills have names, trigger conditions, step sequences, and confidence scores. The skill_manage tool handles creation, loading, editing, and deletion.
This is not reinforcement learning. The model’s weights don’t change. What changes is the agent’s prompt: skills get injected into context when their trigger conditions match the current task. Think of it as the agent writing its own system prompts for specialized workflows.
The sous chef analogy works here. A sous chef who successfully plates a complex dish writes down the recipe for next time. The recipe doesn’t make the chef more skilled. It makes the chef faster at a task they already know how to do. If the chef learned a bad technique, the recipe preserves the bad technique.
What does the memory architecture look like?
Hermes v0.7.0 (released April 3, 2026) introduced pluggable memory backends, which addresses the most common complaint about earlier versions. The architecture has three layers that map to the hierarchical memory model we covered previously:
| Layer | Purpose | Storage | Limit |
|---|---|---|---|
| Working memory | Current session context | In-context window | Model’s context length |
| Episodic memory | Past conversations, task logs | MEMORY.md (built-in) or external backend | ~2,200 chars built-in |
| Semantic memory | User model, preferences | Honcho dialectic profiling | Backend-dependent |
The built-in memory uses a MEMORY.md file, a flat text file capped at roughly 2,200 characters. After about ten sessions of moderate use, the agent must compress older memories to make room for new ones. This compression is lossy. Details get flattened. Nuance disappears.
v0.7.0’s fix: seven pluggable memory providers — Honcho, OpenViking, Mem0, Hindsight, Holographic, RetainDB, and ByteRover. Swap out the built-in store for a dedicated memory backend and the capacity constraint vanishes. But the default experience, the one most new users encounter, still hits that wall.
The user modeling layer uses Honcho’s dialectic profiling to build a persistent model of who you are: your preferences, communication style, domain expertise. This runs across sessions and informs how the agent frames responses. It’s the difference between an agent that says “here’s how to deploy to Kubernetes” and one that says “since you prefer Helm charts over raw manifests, here’s the Helm approach.”
Where does self-improvement break down?
This is the section most Hermes coverage skips. Three failure modes matter in practice.
Failure 1: weak models create weak skills. Hermes is model-agnostic: it runs on DeepSeek, MiniMax, OpenAI, or local models via Ollama and vLLM. A DeepSeek V4 instance at $2/month will create skills, but those skills reflect DeepSeek’s reasoning quality. If the model makes a suboptimal choice during task execution, the skill codifies that suboptimal choice. The learning loop amplifies whatever the base model does, good or bad. Hermes 2 Pro achieves 90% function calling accuracy (per Nous Research benchmarks); general-purpose models of similar size hit 60-70%. The gap between a well-chosen and poorly-chosen base model compounds over time.
Failure 2: skill proliferation without curation. An agent that creates skills aggressively accumulates hundreds of them. Loading 200 skills into context is slower and noisier than loading 40. Without manual curation (reviewing, pruning, merging), the skill library becomes a junk drawer. v0.6.0 introduced multi-agent profiles to segment skills by context, but the curation burden still falls on the operator.
Failure 3: memory compression destroys context. The built-in 2,200-character memory forces aggressive summarization. A conversation about debugging a specific PostgreSQL replication issue becomes “user works with PostgreSQL.” The fix exists (pluggable backends in v0.7.0), but it requires infrastructure setup that most users skipping-through-the-README won’t do.
How does Hermes compare to other agent frameworks?
The comparison most people reach for, Hermes vs. ChatGPT, is wrong. ChatGPT is a conversational interface. Hermes is a persistent process. The right comparisons:
| Feature | Hermes Agent | OpenClaw | LangChain/CrewAI | Claude Code |
|---|---|---|---|---|
| Persistent memory | Yes (pluggable) | Yes (managed) | No (manual) | Session-only |
| Skill creation | Autonomous | Plugin marketplace | No | No |
| Self-hosted | Yes (MIT) | Yes + Cloud ($59/mo) | Yes | No |
| Messaging platforms | 14+ | 8 | None built-in | IDE only |
| MCP integration | Yes | Yes | Partial | Yes |
| Multi-agent | Subagents (no shared state) | Full orchestration | Yes (CrewAI) | Subagents |
| Minimum cost | $5/mo VPS + $2/mo API | $6/mo self-hosted | Free (no infra) | Subscription |
| GitHub stars | 26,300 | ~12,000 | 98,000+ (LangChain) | N/A |
OpenClaw is the closest competitor. The New Stack’s comparison (March 2026) frames it as learning-depth (Hermes) vs. ecosystem-breadth (OpenClaw). OpenClaw Cloud at $59/month eliminates infrastructure management. Hermes at $5 VPS + $2-50/month API costs is cheaper for light users but requires more operational overhead.
LangChain and CrewAI are orchestration frameworks, not persistent runtimes. They manage agent workflows but don’t create skills or maintain cross-session memory. If you need a persistent agent that grows over time, these aren’t direct alternatives.
The A2A vs MCP protocol comparison is relevant here: Hermes adopted MCP for tool extensibility, meaning any MCP-compatible service (GitHub, databases, custom APIs) plugs in directly.
When should you actually use Hermes?
Hermes fits a specific profile. Use it when:
- Tasks repeat with variation. Deploying to staging, processing invoices, triaging support tickets: workflows where the structure is stable but the details change. The skill loop pays off here.
- You need persistent context. If your agent needs to remember that the user prefers Python over TypeScript, or that the staging server uses port 3001, session-based tools waste time re-establishing context.
- You want model flexibility. Running DeepSeek locally for cost, switching to Claude for complex reasoning, using MiniMax for specific tasks. Hermes supports all of these through OpenRouter (200+ models) or direct API connections.
- You control infrastructure. Hermes requires a VPS, API key management, and memory backend configuration. If you want managed, OpenClaw Cloud is simpler.
Skip Hermes when tasks are one-off and diverse (no skill reuse), when you need true multi-agent coordination (subagents can’t share state — tracked in Issues #344/#4529), or when your team doesn’t have the operational capacity to manage a self-hosted agent runtime.
Frequently asked questions
What is Hermes Agent?
Hermes Agent is an open-source (MIT) Python-based agent runtime built by Nous Research, released in February 2026. It creates reusable skills from repeated tasks, maintains persistent memory across sessions, and connects to 14+ messaging platforms. Unlike stateless chatbots, Hermes remembers context and improves its workflows over time through autonomous skill codification. The project has 26,300 GitHub stars and over 240 contributors as of April 2026.
How does Hermes Agent’s skill creation work?
After completing a task, Hermes evaluates whether the workflow is worth preserving. If the pattern recurs (roughly every 15 similar tasks, per Geeky Gadgets reporting), it codifies the steps into a named skill stored as a structured file. On future encounters, the agent loads the relevant skill instead of reasoning from scratch. Skills can be edited, versioned, and shared across agent instances via the skill_manage tool.
How much does it cost to run Hermes Agent?
The software is free (MIT license). Infrastructure costs $5-80/month for a VPS (DigitalOcean, Hetzner, Vultr). LLM API costs range from $2/month with DeepSeek for light usage to $400+/month with frontier models for heavy coding tasks. OpenClaw Cloud, the main competitor, charges $59/month all-in. For light-to-moderate usage, self-hosted Hermes is cheaper.
What are Hermes Agent’s main limitations?
Three to watch: (1) the built-in memory system caps at roughly 2,200 characters, forcing lossy compression after about ten sessions (v0.7.0 adds pluggable backends to mitigate this); (2) skill quality depends on the base model’s reasoning, so weaker models create weaker, less generalizable skills; (3) subagents cannot share state or coordinate directly (GitHub Issues #344/#4529), limiting true multi-agent orchestration.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch