Hermes incident-commander skill — autonomous SRE detection, diagnosis, and self-healing

Find a file

Lethe* d06e51bf7c Update README.md		2026-03-13 12:08:44 +03:00
demo	Add files via upload	2026-03-12 13:32:12 +03:00
docs	Add files via upload	2026-03-12 13:32:12 +03:00
environments	Add files via upload	2026-03-12 13:32:12 +03:00
skills/incident-commander	Add files via upload	2026-03-12 13:32:12 +03:00
tests	Add files via upload	2026-03-12 13:32:12 +03:00
LICENSE	Initial commit	2026-03-12 13:29:49 +03:00
README.md	Update README.md	2026-03-13 12:08:44 +03:00
requirements.txt	Add files via upload	2026-03-12 13:32:12 +03:00

README.md

⚕ Hermes Incident Commander

An autonomous SRE agent that detects, diagnoses, and heals production infrastructure — then learns from every incident it resolves.

Built on Hermes Agent by NousResearch. Submitted for the "Show us what Hermes Agent can do" challenge.

The Problem

When a production server goes down at 3 AM, an on-call engineer has to:

Wake up, check alerts
SSH in, run diagnostics manually
Piece together root cause from logs
Apply a fix - hopefully the right one
Verify it worked
Write a post-mortem nobody will read

Mean time to resolve (MTTR) for P0 incidents averages 45–60 minutes. Much of that is humans doing things a sufficiently capable agent could do faster and better.

Hermes Incident Commander does all of it - autonomously, in minutes, getting smarter with each incident it handles.

Demo

# Install dependencies
pip install anthropic rich

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

# Run a demo incident (disk full scenario)
python demo/demo_incident.py --scenario disk-full-logs

# Try other scenarios
python demo/demo_incident.py --scenario svc-crash-nginx
python demo/demo_incident.py --scenario cpu-runaway-process

What you'll see:

Hermes detects the incident and classifies severity (P0/P1/P2/P3)
Runs parallel diagnostics across CPU, memory, disk, and services
Identifies root cause with explicit reasoning
Applies the safest effective fix
Verifies the fix worked
Writes a structured post-incident report to ~/.hermes/incidents/
Creates a new prevention skill in ~/.hermes/skills/ so it handles this faster next time

How It Uses Every Hermes Feature

This project was designed to push every capability of Hermes Agent:

Hermes Feature	How It's Used
Persistent Memory	Builds a system topology map over time. Learns which services fail together, time-of-day patterns, and which remediations work on YOUR infrastructure.
Skill Auto-Creation	After every novel incident, writes a new `SKILL.md` prevention playbook. Hermes gets measurably better at your stack over weeks.
Cron Scheduler	Every 5 min: critical health check. Every hour: full audit. Daily 08:00: morning briefing to Telegram.
Gateway (Telegram/Discord)	Real-time P0 alerts, resolution notices, and daily briefings delivered to your phone.
Subagent Spawning	For multi-service environments, spawns parallel subagents to investigate nginx, database, and application layers simultaneously.
Session Search (FTS5)	"Have we seen this error before?" - searches past incidents for matching patterns.
execute_code	Collapses multi-step diagnostic pipelines into single inference turns, dramatically reducing latency.
MCP Integration	Connects to cloud provider APIs (AWS/GCP/Azure MCP servers) for auto-scaling and cloud-native remediation.

Architecture

flowchart TD
    ALERT([🚨 Incident Alert]) --> DETECT

    DETECT["🔍 DETECT \n Gather system vitals\nCPU · Memory · Disk · Services"]
    TRIAGE["⚖️ TRIAGE\nClassify severity\nP0 · P1 · P2 · P3"]
    DIAGNOSE["🔬 DIAGNOSE\nRoot cause analysis\nLogs · Processes · Stack traces"]
    REMEDIATE["🔧 REMEDIATE\nApply safest fix\nTier 1 → 2 → 3"]
    VERIFY["✅ VERIFY\nConfirm resolution\nBefore vs after metrics"]

    DETECT --> TRIAGE --> DIAGNOSE --> REMEDIATE --> VERIFY

    CRON["⏱️ CRON\nEvery 5 min: health check\nEvery hour: full audit\nDaily 08:00: briefing"]
    CRON -->|triggers| DETECT

    LEARN["🧠 LEARN\nWrite post-incident report\nCreate prevention SKILL.md\nUpdate MEMORY.md\nSearch past incidents (FTS5)"]
    VERIFY --> LEARN

    GATEWAY["📲 GATEWAY\nTelegram · Discord · Slack"]
    TRIAGE -->|"🚨 P0/P1 alert"| GATEWAY
    VERIFY -->|"✅ resolved"| GATEWAY
    CRON -->|"📋 daily briefing"| GATEWAY

    style DETECT fill:#1e3a5f,color:#fff
    style TRIAGE fill:#7b2d00,color:#fff
    style DIAGNOSE fill:#1e3a5f,color:#fff
    style REMEDIATE fill:#1a4731,color:#fff
    style VERIFY fill:#1a4731,color:#fff
    style LEARN fill:#3d2068,color:#fff
    style CRON fill:#2d2d2d,color:#fff
    style GATEWAY fill:#2d2d2d,color:#fff
    style ALERT fill:#7b2d00,color:#fff

Project Structure

graph LR
    ROOT["📁 hermes-incident-commander"]

    ROOT --> SKILLS["📁 skills/"]
    ROOT --> ENVS["📁 environments/"]
    ROOT --> DEMO["📁 demo/"]
    ROOT --> TESTS["📁 tests/"]
    ROOT --> DOCS["📁 docs/"]
    ROOT --> REQ["📄 requirements.txt"]

    SKILLS --> SKILL_MD["📄 incident-commander/SKILL.md\n← install into ~/.hermes/skills/"]

    ENVS --> ENV_PY["🐍 incident_env.py\n← Atropos RL environment"]
    ENVS --> ENV_CFG["⚙️ incident_config.yaml\n← training configuration"]

    DEMO --> DEMO_PY["🐍 demo_incident.py\n← standalone demo"]

    TESTS --> TEST_PY["🐍 test_incident_env.py\n← pytest test suite"]

    DOCS --> SETUP["📄 SETUP.md"]
    DOCS --> WRITEUP["📄 WRITEUP.md"]

    style ROOT fill:#1e3a5f,color:#fff
    style SKILL_MD fill:#1a4731,color:#fff
    style ENV_PY fill:#3d2068,color:#fff
    style DEMO_PY fill:#7b2d00,color:#fff
    style TEST_PY fill:#2d2d2d,color:#fff

Installation (Full Hermes Setup)

1. Install Hermes Agent

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

2. Configure Hermes

hermes setup        # Interactive setup wizard
hermes model        # Choose your model (Nous Portal recommended)
hermes gateway setup  # Connect Telegram/Discord for alerts

3. Install the Incident Commander Skill

# Copy the skill to Hermes's skills directory
cp -r skills/incident-commander ~/.hermes/skills/

# Verify it's loaded
hermes
> /skills

4. Set Up Monitoring Cron Jobs

In your Hermes conversation:

Set up incident monitoring: run a health check every 5 minutes and alert me
on Telegram if anything is P0 or P1. Send me a daily briefing at 08:00.

Hermes will install the cron jobs automatically.

5. Run the RL Training Environment (Optional)

# Install Atropos
pip install atroposlib

# Generate SFT training data
python environments/incident_env.py process --config environments/incident_config.yaml

# Full RL training (requires VLLM)
python environments/incident_env.py serve --config environments/incident_config.yaml

Reward Function (for RL Training)

The training environment uses a multi-component reward that captures real SRE quality:

pie title Reward Components
    "Resolution — Did the incident get fixed?" : 50
    "RCA Quality — Root cause explained?" : 15
    "Report Quality — Post-mortem written?" : 15
    "Skill Created — Prevention skill added?" : 10
    "Response Speed — Fast MTTR?" : 5
    "Tool Efficiency — Minimal tool calls?" : 5

Incident Scenarios (Training Scenarios)

ID	Severity	Category	Description
`svc-crash-nginx`	P0	service	nginx crashed, website unreachable
`disk-full-logs`	P1	disk	95% disk usage from exploded log files
`memory-leak-process`	P1	memory	Mystery process eating 150MB+
`cpu-runaway-process`	P2	cpu	95% CPU from runaway computation
`failed-systemd-unit`	P2	service	Custom worker service in failed state

Running Tests

# Install test dependencies
pip install pytest pytest-asyncio

# Run full test suite
pytest tests/ -v

# Run specific test classes
pytest tests/test_incident_env.py::TestScenarioDefinitions -v
pytest tests/test_incident_env.py::TestRewardFunction -v
pytest tests/test_incident_env.py::TestSkillFile -v

Why This Wins

Real problem, real impact. P0 incidents cost companies thousands of dollars per minute. Shaving 30 minutes off MTTR with an autonomous agent is immediately valuable.
Uses every Hermes capability. Memory, skills, cron, gateway, subagents, session search, execute_code - all integrated into a coherent, meaningful workflow.
Self-improving. The longer Hermes runs, the better it gets at your specific infrastructure. This is Hermes's core promise - "the agent that grows with you" - demonstrated concretely.
Closes the training loop. The Atropos RL environment means this isn't just a demo - it's a path to training models that are genuinely better at agentic SRE tasks.
Ships with working code. The demo runs standalone, the tests pass, and the skill file installs in one command.

License

MIT

Built with Hermes Agent - the agent that grows with you.

README.md Unescape Escape