Secret Hygiene — Architecture Overview
The ops-infra runbook is the source of truth; this page is the orienting index. Full details:
ops-infra/runbooks/secret-hygiene-playbook.md
Why it matters
Every env var that reaches a running service must trace back to a line in a git-committed manifest or workflow file. When engineers set configuration out-of-band — via kubectl set env, a one-off SSH session, or an ad-hoc Compose file edit — the cluster drifts silently away from the repo. BL-223 surfaced this concretely: founderyos-api's running Deployment had 20+ env vars that existed nowhere in git. A naive kubectl apply -f deployment.yaml would have silently dropped every one of them, replacing the pod with whatever the repo actually declared and losing live configuration with no warning. Many services "tolerate" missing env at boot — they start fine, then fail at request time. That silent regression is the deploy-strip risk.
The second problem is exposure. Plaintext secrets on a live pod are visible to anyone with kubectl describe deploy or kubectl get deploy -o yaml. There is no git history explaining when a value was set, by whom, or why. Platform-wide, PLATFORM-009 standardized three patterns that close both gaps: every var is declared in the manifest (via valueFrom), rendered from GitHub Secrets/Variables at deploy time, and never typed directly onto a running container.
The three patterns
| Pattern | When to use | Storage | Manifest reference |
|---|---|---|---|
A — k8s secretKeyRef / configMapKeyRef | Service runs as a k3s Deployment in the platform or founderyos namespace on AX42-U | Sensitive → GitHub repo Secret; config → GitHub repo Variable | valueFrom.secretKeyRef / valueFrom.configMapKeyRef |
| B — Static frontend bundle | Compiled SPA (Vite/Next.js) served by nginx with no Node runtime at request time | GitHub repo Variable (never Secret — VITE_* values ship in the JS bundle and are public) | No runtime manifest env block; values are baked in at build time via --build-arg |
| C — Docker Compose on a VPS | Service runs under docker compose on a standalone Hetzner VPS (today: oracle-bridge on 65.21.149.226) | GitHub repo Secret / Variable | .env.staging rendered by the deploy workflow and scp'd to the VPS; .env.* is gitignored, .env.*.example is committed |
Reference implementation: payment-gateway (PLATFORM-007.1 + BL-244) is the canonical Pattern A service. Its k8s/deployment.yaml and .github/workflows/deploy-staging.yml show the full render-then-reference cycle: the deploy workflow creates/updates the k8s Secret and ConfigMap with --dry-run=client -o yaml | kubectl apply, then the manifest references every value via valueFrom — no inline secret values, no kubectl set env, no drift.
Pattern B in practice: founderyos-dashboard (PLATFORM-009.3) completed its audit with manifest 0 / live 0 / drift 0 — the expected outcome for a static frontend. If you see a VITE_* variable configured as a k8s Secret, that is a smell: either it is not a real secret and should be a Variable, or it does not belong in the bundle at all.
How to apply
New service: Pick the pattern that matches your runtime (see the decision tree in the playbook). Wire GitHub storage and the deploy-time render step before the first deploy. Do not defer secrets hygiene to a follow-up story.
Existing service with drift: Run
ops-infra/scripts/audit-env-drift.sh <service>to enumerate everylive-onlyvar. Classify each one (sensitive → rotate before migration; config → ConfigMap or.enventry), then migrate per the playbook's migration checklist. Zero live-only vars is the acceptance criterion — not "the pod booted."Rotating a secret: Follow the rotation procedure in the playbook. The short version: generate the new value →
gh secret set→ redeploy → verify the live pod carries the new value → deactivate the old value at the provider → append one line tobmad-artifacts/runbooks/key-rotation-log.md. For cross-service tokens (e.g.TOKEN_PAYMENT_GATEWAYshared between the k8sservice-tokensSecret and oracle-bridge's.env.staging), rotate consumer side first, then caller, then drop the old token from the consumer — the playbook has the exact sequence.
Drift-detection cron
BL-252 added a nightly audit to catch new drift before it causes an incident.
Workflow: ops-infra/.github/workflows/env-drift-audit.ymlSchedule: 06:00 UTC daily + workflow_dispatch
The workflow clones all audit-target repos, SSHes to AX42-U and the oracle-bridge VPS, runs audit-env-drift.sh --all --env=staging, and compares the result against the newest dated baseline in bmad-artifacts/runbooks/env-drift-audit-YYYY-MM-DD.md. Any var that appears in the live set but not in the baseline opens (or comments on) a GitHub issue in ops-infra and emits a GlitchTip warning event. The scheduled run exits non-zero so new drift surfaces immediately in the Actions UI.
When a story intentionally adds new env vars, regenerate the baseline as documented in that baseline report. The workflow picks up the newest dated file automatically — no workflow edit required.
What NOT to do
- Never
kubectl set envon a live Deployment outside ofaudit-env-drift.sh(which is read-only). Every out-of-band set creates exactly the drift PLATFORM-009 exists to eliminate. - Never inline a secret value in a committed manifest. Even briefly — git history is permanent. Once a value lands in git it must be rotated, no exceptions.
- Never reuse staging secrets in production. Staging and production tokens, signing PEMs, and API keys must be distinct. Staging key separation is tracked as BL-206 (pre-prod cutover).
Cross-references
| Reference | Purpose |
|---|---|
ops-infra/runbooks/secret-hygiene-playbook.md | Full playbook — patterns, migration checklist, rotation cadence table, emergency rotation procedure |
oracle-bridge VPS | Pattern C worked example — .env.{staging,production} rendered from GH Secrets + scp'd to the VPS, loaded by docker compose |
ops-infra/scripts/audit-env-drift.sh | Read-only drift detection script |
ops-infra/.github/workflows/env-drift-audit.yml | Nightly drift-detection cron (BL-252) |
bmad-artifacts/runbooks/env-drift-audit-2026-04-19.md | PLATFORM-009.1 baseline — catalogued 23 drifted vars across the fleet |
bmad-artifacts/runbooks/platform-009-3-founderyos-dashboard-disposition.md | Pattern B disposition — founderyos-dashboard |
bmad-artifacts/runbooks/platform-009-5-payment-gateway-notification-service-disposition.md | Pattern A disposition — payment-gateway + notification-service |
ops-infra/runbooks/api-gateway-add-service.md | PLATFORM-006.4 service-tokens pattern (platform namespace) |
system-topology.md | Cross-machine architecture overview — where each secret lives (VPS .env, k8s Secret namespaces, external provider dashboards) in the broader service map |
| BL-223 | Discovery story — 20+ drifted vars on founderyos-api; triggered PLATFORM-009 |
| BL-244 | payment-gateway ICP env reconciliation; Pattern A reference extension |
| BL-206 | Pre-prod key separation — staging vs production signing PEM rotation |