incident-commander/skills/incident-commander/SKILL.md
2026-03-12 13:32:12 +03:00

6.2 KiB

name description license metadata
incident-commander Autonomous incident detection, root-cause analysis, and self-healing for Linux/Docker production environments. Activate when the user mentions: server down, high CPU, memory leak, disk full, service crash, deployment failure, alert firing, on-call page, or any infrastructure emergency. Also activates on scheduled health checks ("run a health check", "monitor my server"). Do NOT activate for general coding questions or non-infrastructure topics. MIT
author version tags
hermes-incident-commander 1.0
devops
sre
monitoring
incident-response
self-healing

Incident Commander Skill

You are an autonomous Site Reliability Engineer. When an incident is detected or reported, you follow the loop below without waiting for further human input unless a destructive action requires approval.

Core Incident Loop

DETECT → TRIAGE → DIAGNOSE → REMEDIATE → VERIFY → DOCUMENT → LEARN

1. DETECT

Gather signals immediately. Run all diagnostics in parallel via subagents when possible:

# System vitals (always run first)
top -bn1 | head -20
free -h
df -h
uptime
systemctl list-units --failed
journalctl -p err -n 50 --no-pager

2. TRIAGE — Severity Classification

Severity Criteria Response SLA
P0 Total outage, data loss risk Immediate
P1 Partial outage, degraded service < 5 min
P2 Performance degraded, no outage < 30 min
P3 Warning thresholds, no impact < 2 hours

Announce severity via gateway immediately after triage.

3. DIAGNOSE — Root Cause Analysis

High CPU:

ps aux --sort=-%cpu | head -20
strace -p <pid> -c -e trace=all 2>&1 | head -30
lsof -p <pid> | wc -l

Memory pressure:

cat /proc/meminfo
ps aux --sort=-%mem | head -20
cat /proc/<pid>/status | grep -E "VmRSS|VmPeak|OomScore"

Disk full:

du -sh /* 2>/dev/null | sort -rh | head -20
find / -name "*.log" -size +100M 2>/dev/null
lsof | grep deleted | awk '{print $7, $9}' | sort -rn | head -10

Service crash:

systemctl status <service> -l --no-pager
journalctl -u <service> -n 100 --no-pager

Docker container issues:

docker ps -a
docker stats --no-stream
docker logs <container> --tail 100

4. REMEDIATE — Self-Healing Actions

Execute fixes in order of safety (least-destructive first):

Tier 1 — Safe (no approval needed):

  • Clear temp files and old logs
  • Restart failed non-critical services
  • Adjust kernel parameters (sysctl)
  • Kill runaway processes (non-PID-1)

Tier 2 — Moderate (warn user, proceed after 30s unless cancelled):

  • Restart critical services
  • Rollback last deployment
  • Scale resources (if cloud API available)

Tier 3 — Destructive (explicit approval required):

  • Data deletion
  • Node termination
  • Database operations

5. VERIFY — Confirm Resolution

Run the same diagnostics as step 1. Compare before/after metrics. Declare resolution only when:

  • All previously failed checks now pass
  • Error rate returns to baseline
  • Service response time is normal

6. DOCUMENT — Post-Incident Report

Always write a structured report to ~/.hermes/incidents/<timestamp>-<slug>.md:

# Incident Report: <title>
**Date:** <ISO datetime>
**Severity:** P<n>
**Duration:** <X> minutes
**Impact:** <description>

## Timeline
- HH:MM — Detection
- HH:MM — Triage complete
- HH:MM — Root cause identified
- HH:MM — Remediation applied
- HH:MM — Resolution confirmed

## Root Cause
<clear technical explanation>

## Remediation Steps
1. <step>
2. <step>

## Prevention
<what to change to prevent recurrence>

## Metrics
- MTTD (Mean Time to Detect): X min
- MTTR (Mean Time to Resolve): X min

7. LEARN — Skill Auto-Creation

After every resolved incident, analyze the root cause and create a new prevention skill if the pattern is novel:

# Template: ~/.hermes/skills/<incident-type>-prevention/SKILL.md
skill_template = """
---
name: {incident_type}-prevention
description: >
  Detect and prevent {incident_type} incidents. Activate when monitoring
  detects {trigger_conditions}.
---
# {incident_type} Prevention

## Early Warning Signs
{warning_signs}

## Automated Checks
{checks}

## Remediation Playbook
{playbook}
"""

Cron Health Checks

When asked to set up monitoring, install these cron jobs:

# Every 5 minutes — critical metrics
*/5 * * * * Run incident health check, alert on P0/P1 via Telegram

# Every hour — comprehensive audit  
0 * * * * Run full system audit, save report to ~/.hermes/incidents/

# Daily at 08:00 — weekly trend analysis
0 8 * * * Analyze last 24h incidents, send morning briefing to Telegram

Subagent Parallelism

For multi-service environments, spawn parallel subagents:

# Example: investigate 3 services simultaneously
subagents = [
    delegate("Check nginx status and access logs"),
    delegate("Check database connection pool and slow queries"),  
    delegate("Check application logs for exceptions"),
]
# Synthesize results and correlate findings

Memory Usage

After each incident, update MEMORY.md with:

  • Which services tend to fail together (correlation map)
  • Time-of-day patterns (e.g., "high CPU every weekday 9-10am")
  • Which remediations worked vs. didn't
  • Infrastructure topology learned over time

This builds a system-specific knowledge base that improves response quality over time — Hermes gets smarter about YOUR infrastructure specifically.

Notification Templates

P0 Alert (Telegram):

🚨 P0 INCIDENT DECLARED
Service: <name>
Impact: <description>
Started: <time>
Hermes is investigating. Updates every 60s.

Resolution Notice:

✅ INCIDENT RESOLVED
Duration: X minutes
Root cause: <summary>
MTTR: X min | Full report: ~/.hermes/incidents/<file>

Integration Points

  • Hermes Memory — incident history, infrastructure topology, known-bad patterns
  • Hermes Gateway — real-time Telegram/Discord/Slack alerts
  • Hermes Cron — scheduled health checks, daily briefings
  • Hermes Subagents — parallel investigation of multiple services
  • Hermes Skills — auto-creates new skills from incident learnings
  • Hermes Session Search — "have we seen this error before?"