Mikael Hugo fce0c4c781 Tier 1.1: Implement vault credential resolver for provider keys

- Add vault-credential-resolver.js: Async credential resolution with vault:// URI support
- Integration with vault-resolver.js (low-level Vault client)
- Update doctor-providers.js to detect and report vault URIs
- Synchronous doctor checks (no network I/O) with lazy async resolution
- Fail-open semantics: vault unavailable -> fall back to plaintext
- 28 tests for credential resolver (all passing)
- ADR-0078: Architecture and auth chain documentation

Features:
- vault://secret/path/to/secret#fieldname URI format
- Auth chain: VAULT_TOKEN -> ~/.vault-token -> AppRole (reserved)
- Helper functions: couldBeVaultUri, hasProviderCredentialEnvVar, resolveProviderCredential, getCredentialValue, formatCredentialInfo
- Full backward compatibility with plaintext keys and auth.json

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

2026-05-07 04:59:07 +02:00

7.6 KiB

Raw Permalink Blame History

id	title	status	date
0078	Vault Credential Resolution for Provider Keys	accepted	2026-05-07

ADR-0078: Vault Credential Resolution for Provider Keys

Problem

SF v3.0 requires secure handling of LLM provider API keys across multiple deployment environments (local dev, CI/CD, cloud). Currently, API keys are stored as plaintext in:

Environment variables (.env, shell, CI secrets)
Auth storage files (auth.json)

This approach has security and operational risks:

Secret sprawl: Keys duplicated across many environment configs
Audit gap: No audit trail of which systems accessed which secrets
Rotation friction: Manual key updates across multiple systems
Principle of Least Privilege violation: All agents have access to all keys

Decision

Implement Vault credential resolution that:

Allows provider keys to reference HashiCorp Vault URIs instead of plaintext
Maintains backward compatibility with plaintext keys and auth.json
Uses fail-open semantics: if Vault unavailable, falls back to plaintext
Supports async resolution at runtime (no blocking on startup)
Keeps doctor checks synchronous (fast health check without HTTP calls)

URI Format

vault://secret/path/to/secret#fieldname

Examples:

ANTHROPIC_API_KEY=vault://secret/anthropic/prod#api_key
OPENAI_API_KEY=vault://secret/openai/prod#api_key
GROQ_API_KEY=vault://secret/groq/prod#api_key

Authentication Chain

In order of preference:

VAULT_ADDR and VAULT_TOKEN environment variables
~/.vault-token file (standard Vault client behavior)
AppRole (VAULT_ROLE_ID + VAULT_SECRET_ID) — reserved for future use
Fail open: if no auth method available, return plaintext URI

Resolution Chain for Provider Keys

When SF or pi-ai needs a provider credential:

Check environment variable (e.g., ANTHROPIC_API_KEY)
If value starts with vault://, call async resolver to fetch from Vault
If Vault unavailable, use URI string as plaintext (fail-open)
Otherwise, check auth.json
Return undefined if not found

Doctor Checks (Synchronous)

Health checks remain fast by:

Checking if env var exists AND is non-empty (doesn't matter if it's a URI)
If env var contains vault://, report "Vault" as source but don't resolve
Actual resolution happens later when credentials are used

Implementation

New Modules

vault-credential-resolver.js — Provider credential resolution with vault support

couldBeVaultUri(value) — Check if value looks like vault URI (no network I/O)
hasProviderCredentialEnvVar(envVarName) — Check if env var exists (no network I/O)
resolveProviderCredential(envValue) — Resolve vault URI to actual key (async)
resolveProviderCredentials(map) — Resolve multiple credentials (async)
getCredentialValue(result, strictMode) — Extract/validate resolved value
formatCredentialInfo(result, providerId) — Format for doctor output (masks value)

vault-resolver.js (existing) — Low-level vault client

parseVaultUri(uri) — Parse vault:// URIs
resolveVaultToken() — Resolve auth token from env/file/AppRole
resolveSecret(uri, opts) — Fetch secret from Vault with fail-open

Integration Points

doctor-providers.js — Updated to detect vault URIs
- resolveKey() now checks couldBeVaultUri() for vault:// URIs
- Reports "vault" as source for vault URIs (no blocking)
pi-ai getEnvApiKey() — No changes needed initially
- Returns vault:// URI as-is (callers must resolve async if needed)
- Future: add async variant getEnvApiKeyAsync() for direct vault support
pi-coding-agent resolve-config-value.ts — Already supports vault URIs
- resolveConfigValueAsync() handles vault:// URIs
- Used when pi-ai actually makes API calls
SF agent setup — Can initialize credential cache
- Pre-resolve commonly-used credentials at startup
- Cache with TTL (default 5 minutes, configurable)

Rationale

Why Fail-Open?

Vault may not be available in all environments (local dev, offline use)
Graceful degradation allows fallback to plaintext keys without blocking
Operator can choose strict mode if needed

Why Async?

Network I/O to Vault happens at credential usage time, not startup
Startup remains fast (doctor checks are synchronous)
Credentials can be refreshed by re-resolving throughout session

Why Not Modify pi-ai getEnvApiKey?

getEnvApiKey is sync; vault resolution is async
Cleaner separation: pi-ai doesn't know about vault
SF or pi-coding-agent handles async resolution at the point of use
Allows gradual migration: new code uses async, old code still works with plaintext

Vault KV v2 API

Vault path structure:

secret/                       # Mount point
├── anthropic/               # Provider
│   ├── prod                 # Environment/secret name
│   │   └── api_key          # Field in secret
│   └── dev
└── openai/
    ├── prod
    │   ├── api_key
    │   └── org_id
    └── staging

URI to fetch api_key from secret/anthropic/prod:

vault://secret/anthropic/prod#api_key

Query Patterns (Future)

With vault URIs persisted in config, audit/operations teams can:

-- Find all provider credentials using vault
SELECT provider_id, env_var_name, env_var_value FROM provider_config
WHERE env_var_value LIKE 'vault://%';

-- Reconstruct which services were using which vault secrets
SELECT config.provider_id, secrets.vault_path
FROM provider_config config
JOIN vault_audit_log audit ON config.env_var_value = audit.uri
JOIN vault_secrets secrets ON audit.secret_id = secrets.id;

Security Considerations

Token Storage: VAULT_TOKEN or ~/.vault-token must be protected (owner-only readable)
Network: Use HTTPS for Vault connections (VAULT_ADDR should be https://)
Audit: Enable Vault audit logging to track secret access
AppRole Rotation: Rotate VAULT_SECRET_ID regularly (future implementation)
Plaintext Fallback: Explicitly using fail-open means operators must be aware vault could be bypassed in edge cases

Backward Compatibility

Plaintext API keys continue to work unchanged
Existing auth.json credentials unaffected
No breaking changes to SF or pi-ai APIs
Doctor checks work exactly the same (just report vault as source when applicable)

Testing Strategy

Unit tests — Vault resolver with mocked fetch
- URI parsing (valid/invalid formats)
- Auth chain (env, file, AppRole not yet)
- Caching TTL
- Fail-open behavior
Integration tests (manual, requires Vault instance)
- End-to-end: set ANTHROPIC_API_KEY=vault://..., verify SF picks it up
- Auth chain: test each auth method (VAULT_TOKEN, ~/.vault-token)
- Doctor checks: verify "Vault" source reported without network I/O
Regression tests
- Plaintext keys still work
- auth.json still used as fallback
- No new test failures in existing suite

Future Work

AppRole support — For CI/CD without token files
Dynamic credentials — Use Vault to generate temporary DB/API credentials
Automated key rotation — Periodically fetch fresh credentials from Vault
Audit integration — Log which credentials were used (for compliance)
Multi-environment — Support vault://secret/anthropic/prod#api_key vs vault://secret/anthropic/staging#api_key per phase

7.6 KiB Raw Permalink Blame History