singularity-forge/docs/adr/0078-vault-credential-resolution.md
Mikael Hugo fce0c4c781 Tier 1.1: Implement vault credential resolver for provider keys
- Add vault-credential-resolver.js: Async credential resolution with vault:// URI support
- Integration with vault-resolver.js (low-level Vault client)
- Update doctor-providers.js to detect and report vault URIs
- Synchronous doctor checks (no network I/O) with lazy async resolution
- Fail-open semantics: vault unavailable -> fall back to plaintext
- 28 tests for credential resolver (all passing)
- ADR-0078: Architecture and auth chain documentation

Features:
- vault://secret/path/to/secret#fieldname URI format
- Auth chain: VAULT_TOKEN -> ~/.vault-token -> AppRole (reserved)
- Helper functions: couldBeVaultUri, hasProviderCredentialEnvVar, resolveProviderCredential, getCredentialValue, formatCredentialInfo
- Full backward compatibility with plaintext keys and auth.json

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-07 04:59:07 +02:00

7.6 KiB

id title status date
0078 Vault Credential Resolution for Provider Keys accepted 2026-05-07

ADR-0078: Vault Credential Resolution for Provider Keys

Problem

SF v3.0 requires secure handling of LLM provider API keys across multiple deployment environments (local dev, CI/CD, cloud). Currently, API keys are stored as plaintext in:

  • Environment variables (.env, shell, CI secrets)
  • Auth storage files (auth.json)

This approach has security and operational risks:

  1. Secret sprawl: Keys duplicated across many environment configs
  2. Audit gap: No audit trail of which systems accessed which secrets
  3. Rotation friction: Manual key updates across multiple systems
  4. Principle of Least Privilege violation: All agents have access to all keys

Decision

Implement Vault credential resolution that:

  1. Allows provider keys to reference HashiCorp Vault URIs instead of plaintext
  2. Maintains backward compatibility with plaintext keys and auth.json
  3. Uses fail-open semantics: if Vault unavailable, falls back to plaintext
  4. Supports async resolution at runtime (no blocking on startup)
  5. Keeps doctor checks synchronous (fast health check without HTTP calls)

URI Format

vault://secret/path/to/secret#fieldname

Examples:

ANTHROPIC_API_KEY=vault://secret/anthropic/prod#api_key
OPENAI_API_KEY=vault://secret/openai/prod#api_key
GROQ_API_KEY=vault://secret/groq/prod#api_key

Authentication Chain

In order of preference:

  1. VAULT_ADDR and VAULT_TOKEN environment variables
  2. ~/.vault-token file (standard Vault client behavior)
  3. AppRole (VAULT_ROLE_ID + VAULT_SECRET_ID) — reserved for future use
  4. Fail open: if no auth method available, return plaintext URI

Resolution Chain for Provider Keys

When SF or pi-ai needs a provider credential:

  1. Check environment variable (e.g., ANTHROPIC_API_KEY)
  2. If value starts with vault://, call async resolver to fetch from Vault
  3. If Vault unavailable, use URI string as plaintext (fail-open)
  4. Otherwise, check auth.json
  5. Return undefined if not found

Doctor Checks (Synchronous)

Health checks remain fast by:

  1. Checking if env var exists AND is non-empty (doesn't matter if it's a URI)
  2. If env var contains vault://, report "Vault" as source but don't resolve
  3. Actual resolution happens later when credentials are used

Implementation

New Modules

vault-credential-resolver.js — Provider credential resolution with vault support

  • couldBeVaultUri(value) — Check if value looks like vault URI (no network I/O)
  • hasProviderCredentialEnvVar(envVarName) — Check if env var exists (no network I/O)
  • resolveProviderCredential(envValue) — Resolve vault URI to actual key (async)
  • resolveProviderCredentials(map) — Resolve multiple credentials (async)
  • getCredentialValue(result, strictMode) — Extract/validate resolved value
  • formatCredentialInfo(result, providerId) — Format for doctor output (masks value)

vault-resolver.js (existing) — Low-level vault client

  • parseVaultUri(uri) — Parse vault:// URIs
  • resolveVaultToken() — Resolve auth token from env/file/AppRole
  • resolveSecret(uri, opts) — Fetch secret from Vault with fail-open

Integration Points

  1. doctor-providers.js — Updated to detect vault URIs

    • resolveKey() now checks couldBeVaultUri() for vault:// URIs
    • Reports "vault" as source for vault URIs (no blocking)
  2. pi-ai getEnvApiKey() — No changes needed initially

    • Returns vault:// URI as-is (callers must resolve async if needed)
    • Future: add async variant getEnvApiKeyAsync() for direct vault support
  3. pi-coding-agent resolve-config-value.ts — Already supports vault URIs

    • resolveConfigValueAsync() handles vault:// URIs
    • Used when pi-ai actually makes API calls
  4. SF agent setup — Can initialize credential cache

    • Pre-resolve commonly-used credentials at startup
    • Cache with TTL (default 5 minutes, configurable)

Rationale

Why Fail-Open?

  • Vault may not be available in all environments (local dev, offline use)
  • Graceful degradation allows fallback to plaintext keys without blocking
  • Operator can choose strict mode if needed

Why Async?

  • Network I/O to Vault happens at credential usage time, not startup
  • Startup remains fast (doctor checks are synchronous)
  • Credentials can be refreshed by re-resolving throughout session

Why Not Modify pi-ai getEnvApiKey?

  • getEnvApiKey is sync; vault resolution is async
  • Cleaner separation: pi-ai doesn't know about vault
  • SF or pi-coding-agent handles async resolution at the point of use
  • Allows gradual migration: new code uses async, old code still works with plaintext

Vault KV v2 API

Vault path structure:

secret/                       # Mount point
├── anthropic/               # Provider
│   ├── prod                 # Environment/secret name
│   │   └── api_key          # Field in secret
│   └── dev
└── openai/
    ├── prod
    │   ├── api_key
    │   └── org_id
    └── staging

URI to fetch api_key from secret/anthropic/prod:

vault://secret/anthropic/prod#api_key

Query Patterns (Future)

With vault URIs persisted in config, audit/operations teams can:

-- Find all provider credentials using vault
SELECT provider_id, env_var_name, env_var_value FROM provider_config
WHERE env_var_value LIKE 'vault://%';

-- Reconstruct which services were using which vault secrets
SELECT config.provider_id, secrets.vault_path
FROM provider_config config
JOIN vault_audit_log audit ON config.env_var_value = audit.uri
JOIN vault_secrets secrets ON audit.secret_id = secrets.id;

Security Considerations

  1. Token Storage: VAULT_TOKEN or ~/.vault-token must be protected (owner-only readable)
  2. Network: Use HTTPS for Vault connections (VAULT_ADDR should be https://)
  3. Audit: Enable Vault audit logging to track secret access
  4. AppRole Rotation: Rotate VAULT_SECRET_ID regularly (future implementation)
  5. Plaintext Fallback: Explicitly using fail-open means operators must be aware vault could be bypassed in edge cases

Backward Compatibility

  • Plaintext API keys continue to work unchanged
  • Existing auth.json credentials unaffected
  • No breaking changes to SF or pi-ai APIs
  • Doctor checks work exactly the same (just report vault as source when applicable)

Testing Strategy

  1. Unit tests — Vault resolver with mocked fetch

    • URI parsing (valid/invalid formats)
    • Auth chain (env, file, AppRole not yet)
    • Caching TTL
    • Fail-open behavior
  2. Integration tests (manual, requires Vault instance)

    • End-to-end: set ANTHROPIC_API_KEY=vault://..., verify SF picks it up
    • Auth chain: test each auth method (VAULT_TOKEN, ~/.vault-token)
    • Doctor checks: verify "Vault" source reported without network I/O
  3. Regression tests

    • Plaintext keys still work
    • auth.json still used as fallback
    • No new test failures in existing suite

Future Work

  1. AppRole support — For CI/CD without token files
  2. Dynamic credentials — Use Vault to generate temporary DB/API credentials
  3. Automated key rotation — Periodically fetch fresh credentials from Vault
  4. Audit integration — Log which credentials were used (for compliance)
  5. Multi-environment — Support vault://secret/anthropic/prod#api_key vs vault://secret/anthropic/staging#api_key per phase

References